The brand new LiteRT NeuroPilot Accelerator from Google and MediaTek is a concrete step towards operating actual generative fashions on telephones, laptops, and IoT {hardware} with out transport each request to a knowledge middle. It takes the present LiteRT runtime and wires it instantly into MediaTek’s NeuroPilot NPU stack, so builders can deploy LLMs and embedding fashions with a single API floor as an alternative of per chip customized code.
What’s LiteRT NeuroPilot Accelerator?
LiteRT is the successor of TensorFlow Lite. It’s a excessive efficiency runtime that sits on gadget, runs fashions in .tflite FlatBuffer format, and might goal CPU, GPU and now NPU backends via a unified {hardware} acceleration layer.
LiteRT NeuroPilot Accelerator is the brand new NPU path for MediaTek {hardware}. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. As an alternative of treating the NPU as a skinny delegate, LiteRT now makes use of a Compiled Mannequin API that understands Forward of Time (AOT) compilation and on gadget compilation, and exposes each via the identical C++ and Kotlin APIs.
On the {hardware} aspect, the mixing at present targets MediaTek Dimensity 7300, 8300, 9000, 9200, 9300 and 9400 SoCs, which collectively cowl a big a part of the Android mid vary and flagship gadget house.
Why Builders Care, Unified Workflow For Fragmented NPUs??
Traditionally, on gadget ML stacks have been CPU and GPU first. NPU SDKs shipped as vendor particular toolchains that required separate compilation flows per SoC, customized delegates, and handbook runtime packaging. The outcome was a combinatorial explosion of binaries and a number of gadget particular debugging.
LiteRT NeuroPilot Accelerator replaces that with a three step workflow that’s the identical no matter which MediaTek NPU is current:
- Convert or load a
.tflitemannequin as standard. - Optionally use the LiteRT Python instruments to run AOT compilation and produce an AI Pack that’s tied to a number of goal SoCs.
- Ship the AI Pack via Play for On-device AI (PODAI), then choose
Accelerator.NPUat runtime. LiteRT handles gadget concentrating on, runtime loading, and falls again to GPU or CPU if the NPU isn’t accessible.
For you as an engineer, the primary change is that gadget concentrating on logic strikes right into a structured configuration file and Play supply, whereas the app code largely interacts with CompiledModel and Accelerator.NPU.
AOT and on gadget compilation are each supported. AOT compiles for a recognized SoC forward of time and is really useful for bigger fashions as a result of it removes the price of compiling on the person gadget. On gadget compilation is healthier for small fashions and generic .tflite distribution, at the price of greater first run latency. The weblog reveals that for a mannequin corresponding to Gemma-3-270M, pure on gadget compilation can take greater than 1 minute, which makes AOT the lifelike choice for manufacturing LLM use.
Gemma, Qwen, And Embedding Fashions On MediaTek NPU
The stack is constructed round open weight fashions reasonably than a single proprietary NLU path. Google and MediaTek record express, manufacturing oriented assist for:
- Qwen3 0.6B, for textual content era in markets corresponding to mainland China.
- Gemma-3-270M, a compact base mannequin that’s straightforward to advantageous tune for duties like sentiment evaluation and entity extraction.
- Gemma-3-1B, a multilingual textual content solely mannequin for summarization and common reasoning.
- Gemma-3n E2B, a multimodal mannequin that handles textual content, audio and imaginative and prescient for issues like actual time translation and visible query answering.
- EmbeddingGemma 300M, a textual content embedding mannequin for retrieval augmented era, semantic search and classification.
On the most recent Dimensity 9500, operating on a Vivo X300 Professional, the Gemma 3n E2B variant reaches greater than 1600 tokens per second in prefill and 28 tokens per second in decode at a 4K context size when executed on the NPU.
For textual content era use circumstances, LiteRT-LM sits on high of LiteRT and exposes a stateful engine with a textual content in textual content out API. A typical C++ move is to create ModelAssets, construct an Engine with litert::lm::Backend::NPU, then create a Session and name GenerateContent per dialog. For embedding workloads, EmbeddingGemma makes use of the decrease degree LiteRT CompiledModel API in a tensor in tensor out configuration, once more with the NPU chosen via {hardware} accelerator choices.
Developer Expertise, C++ Pipeline And Zero Copy Buffers
LiteRT introduces a brand new C++ API that replaces the older C entry factors and is designed round express Setting, Mannequin, CompiledModel and TensorBuffer objects.
For MediaTek NPUs, this API integrates tightly with Android’s AHardwareBuffer and GPU buffers. You’ll be able to assemble enter TensorBuffer situations instantly from OpenGL or OpenCL buffers with TensorBuffer::CreateFromGlBuffer, which lets picture processing code feed NPU inputs with out an intermediate copy via CPU reminiscence. That is essential for actual time digicam and video processing the place a number of copies per body shortly saturate reminiscence bandwidth.
A typical excessive degree C++ path on gadget appears like this, omitting error dealing with for readability:
// Load mannequin compiled for NPU
auto mannequin = Mannequin::CreateFromFile("mannequin.tflite");
auto choices = Choices::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
// Create compiled mannequin
auto compiled = CompiledModel::Create(*env, *mannequin, *choices);
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Learn(output_span);
The identical Compiled Mannequin API is used whether or not you’re concentrating on CPU, GPU or the MediaTek NPU, which reduces the quantity of conditional logic in utility code.
Key Takeaways
- LiteRT NeuroPilot Accelerator is the brand new, first-class NPU integration between LiteRT and MediaTek NeuroPilot, changing the previous TFLite delegate and exposing a unified Compiled Mannequin API with AOT and on gadget compilation on supported Dimensity SoCs.
- The stack targets concrete open weight fashions, together with Qwen3-0.6B, Gemma-3-270M, Gemma-3-1B, Gemma-3n-E2B and EmbeddingGemma-300M, and runs them via LiteRT and LiteRT LM on MediaTek NPUs with a single accelerator abstraction.
- AOT compilation is strongly really useful for LLMs, for instance Gemma-3-270M can take greater than 1 minute to compile on gadget, so manufacturing deployments ought to compile as soon as within the pipeline and ship AI Packs through Play for On gadget AI.
- On a Dimensity 9500 class NPU, Gemma-3n-E2B can attain greater than 1600 tokens per second in prefill and 28 tokens per second in decode at 4K context, with measured throughput as much as 12 occasions CPU and 10 occasions GPU for LLM workloads.
- For builders, the C++ and Kotlin LiteRT APIs present a typical path to pick
Accelerator.NPU, handle compiled fashions and use zero copy tensor buffers, so CPU, GPU and MediaTek NPU targets can share one code path and one deployment workflow.
Try the Docs and Technical particulars. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as effectively.










