Generative AI

Google Literty Neuropilot Stack turns Mediatek Cimensity NPUS into first class LLMS device test

The new Litert Neuropilot accelerator from Google and MediaTek is a concrete step in running Real models that produce on phones, laptops, and iot Hardware without sending the entire application to the data center. It takes Liwert's existing runtime and is wired directly to Mediatetk's Neuropilot NPU Stack, so developers can deploy LLMs and embedded models with a single API code instead of chip-by-chip code.

What is the Liyert Neuropilot accelerator?

Liswert is a tensorflow lite fan. It is a high-performance workstation that resides on the device, it works on models .tflite FlatBuffeiff format, and can accelerate CPU, GPU and now NPU Backends by using the integrated Hardware Experiencer layer.

The Litert Neuropilot accelerator is Mediatek Hardware's new NPU. It replaces the older Tfriopilot Tflite Neuropilot with direct integration into the Neuropilot compiler and runtime. Instead of treating the NPU as a thin layer, Liwert now uses a unified API model for early (aot) integration and resource integration, and exposes both through the same APIS for C++ and kotlin.

On the hardware side, the integration currently targets the Mediatek sizes 7300, 8300, 9000, 9400, 9300 and 9400 and 9400, together with the Android Mid Range and Flagy Space Space.

Why They Are Not Developed to Care, Collaborative Spillover of Separated NPUS Functionality??

Historically, in ML Stacks it was CPU and GPU first. The NPU SDKS has been delivered as a tool for vendors who need different SoC flows, custom hosts, and runtime optimizations. The result was a combined explosion of binaries and multiple device crashes.

The Litert Neuropilot accelerator replaces that with Three workflows That's the same regardless of which Mediatek NPU is present:

  • Convert or upload a .tflite model as usual.
  • Optionally Use the Litert Python tools to run aot compilation and generate an AI package bound to one or more socs.
  • Send the AI ​​packet by using the Ai play (podai), then select Accelerator.NPU during startup. Liwert handles device orientation, runtime loading, and falls back to the GPU or CPU if the NPU is not available.

For you as a developer, the biggest change is that device routing logic moves to the configured configuration file and deliverables, while the application code interacts more with it. CompiledModel and Accelerator.NPU.

AOT and device clustering are both supported. AOT integrates SOM known ahead of time and is recommended for large models because it removes integration costs from the user device. In the combination of the resource is better for small and generic models .tflite distribution, at the cost of initial latency. The blog shows that with a model like the gemma-3-270m, pure in resource integration can take more than 1 minute, which makes it a reasonable option to use LLM production.

Gemma, Qwen, and embedding models on Mediatek NPU

This stack is built around Open Weight models rather than a single Nlu Porprietary Nlu method. The list of Google and Mediatetk is clear, Support provided for the production of:

  • Qwen3 0.6b, for text generation in markets such as mainland China.
  • Gemma-3-270M, a smart basic model that is easy to use for tasks such as sentiment analysis and business outsourcing.
  • Gemma-3-1b, the only multilingual model is a model for summarization and general reasoning.
  • Gemma-3N E2B, a multimodal model that handles text, audio and vision for things like Real Time Translation and Visual Teal translation.
  • EmbengGemima 300m, a text embedding model for real-time generation, semantic search and classification.

With the latest size of 9500, running on the Vivo X300 Pro, the Gemma 3N E2B variant reaches more than 1600 tokens per second in 4K length 4K when executed on the NPU.

Using text generation, literty-lm cases sit on top of Liwert and expose a text-protected engine to the text output via the API. A standard flow of C++ creation ModelAssetshis an Engine and litert::lm::Backend::NPUthen create a Session then drive GenerateContent each conversation. For embedding workloads, embedding uses a lower level of lindert CompiledModel API for tensor stem in Tessor Out configuration, and NPU selected for Hardware Accelerator options.

Developer experience, C++ pipeline and zero Copy Buffers

Liswert introduces a new C++ API that replaces the old C APIs and is designed more transparently Environment, Model, CompiledModel and TensorBuffer things.

With Mediatek NPUS, this api integrates tightly with Android's AHardwareBuffer and GPU buffers. You can create an input TensorBuffer Instances directly from OpenGL or Opencl Buffers with TensorBuffer::CreateFromGlBufferwhich allows the image processing code Code Terminate NPU input without intermediate copy by CPU memory. This is important in real-time camera and video processing where multiple copies per frame is a memory drain.

The standard high-level C++ method for the device looks like this, leaving the error handling to be clear:

// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);

// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

The integrated model API is used whether you are targeting a CPU, GPU or Mediatek NPU, which reduces the amount of conditional logic in the application code.

Key acquisition

  1. The LaiLert Neuropilot Accelerator is a new, first NPU integration between Lindert and Mediatek Neuropilot, replacing the old model of Tflite and presenting an integrated model of the integrated model with aot and resource integration in pumensity soci supported.
  2. The stack aims at open weight models of concrete, including QWEN3-0.6B, Gemma-3-270m, Gemma-3n, Gemma-3N-E2B and Limma-300, and use them by using Lindert and Lindert LM with Mediatek NPUS with one accele output.
  3. AOT integration is highly recommended for LLMS, for example Gemma-3-270m can take more than 1 minute to integrate on the device, so the production must be integrated once in the pipeline and ships ai with Play for AI AI.
  4. With a range of 9500 class NPU, Gemma-3N-E2B can achieve more than 1600 tokens per second and 28 tokens per second for CPU and 10 times GPU for LLM tasks.
  5. For developers, C++ and Kotlin Lifert APIS provides a standard way to choose Accelerator.NPUManage integrated models and use zero Copy Tensor Buffers, so CPU, GPU and mediatek NPU targets can share a single code path and a single workflow.

Look Documents and Technical details. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.


Michal Sutter is a data scientist with a Master of Science in Data Science from the University of PADOVA. With a strong foundation in statistical analysis, machine learning, and data engineering, Mikhali excels at turning complex data into actionable findings.

Follow Marktechpost: Add us as a favorite source on Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button