Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and Mobile Format Cut On-Device Memory

Google DeepMind has released Quantization-Aware Training (QAT) test environments for the Gemma 4 family. The release targets local use on edge devices and consumer GPUs. It follows the launch of the Gemma 4 in April and the 12B model two days earlier.
We compared the available formats of the Gemma 4 model using only published numbers. The goal was simple. Show how much each level of precision costs in memory. Then show what QAT actually changes.
What QAT actually does
Quantization degrades the model by reducing the weighting accuracy. Standard Post-Training Quantization (PTQ) compresses the completed model. That tends to lower the quality. QAT instead simulates quantization during training. The model learns to compensate for the loss of precision.
Google's AI team claims that its QAT results are of a much higher quality than standard PTQ scores. Google did not publish Gemma 4 QAT benchmark scores in the announcement. In context, the Gemma 3 QAT determined a 54% reduction in confusion for Q4_0 using the llama.cpp test. We cite that only as a precursor to the previous generation.
Comparison function
Compare Gemma 4 E2B and E4B in all three formats. The formats are BF16, Q4_0 QAT, and the new QAT mobile schema. Place them in memory, quality preservation, and device accessibility. Use only published figures.
Effects of memory
| Format | E2B | E4B | The foundation |
|---|---|---|---|
| BF16 (16-bit) | 9.6 GB | 15 GB | Official Gemma 4 docs |
| Q4_0 (4-bit, QAT) | 3.2 GB | 5 GB | Official Gemma 4 docs |
| Mobile (QAT, E2B) | ~1 GB | – | QAT announcement |
Q4_0 calculations are similar to the PTQ Q4_0 trace. QAT does not resize in a given format. It improves quality for that size. The new mobile schema brings further reductions.
Using that mobile schema, Google reduced the Gemma 4 E2B to around 1GB. Engineers can always come down. A text-only model with no Per-Layer Embedding requires less than 1GB, minus the audio and visual embeddings.
A breakdown of each format
BF16 is a quality base. E2B requires 9.6 GB and E4B requires 15 GB. It is a reference point, not the destination of the call.
Q4_0 QAT is a general purpose local format. E2B goes down to 3.2 GB and E4B to 5 GB. QAT maintains more quality here than PTQ at the same size. This format is compatible with consumer GPUs. The previous E2B test also ran on a Raspberry Pi 5 at INT4.
The mobile format is a special schema. It brings E2B to about 1 GB. It uses static activation, channel-wise equalization, and 2-bit target compression.
How we work mobile schema
The Google AI team developed four methods for mobile hardware. Fixed activation precalculates scaling during training, reducing device performance. Channel-wise scaling fits the design of mobile accelerators. Target 2-bit scaling only compresses the token generation layers. The embedding and optimization of the KV cache shortens the working memory.
The main thinking layers remain with high accuracy. That saves power while cutting storage. Developers can also use text only and omit audio and visual encoders. That further reduces memory for use in situations that do not require multimodality.
Dimension classification
Scores the quality level of the formats used on the device. Memory is the only axis that is heavily weighted. Quality reflects Google's disclosed design, not measured by Gemma 4 numbers. Each score has a one-line basis.
| Size | BF16 | Q4_0 QAT | QAT Mobile |
|---|---|---|---|
| Foot memory | 1 — heavy, 9.6 GB E2B | 4 — 3.2 GB E2B | 5 — ~1 GB E2B text only |
| Quality preservation | 5 – full precision basis | 4 – QAT is kept, close to the base | 3 — 2-bit token layers, the core is kept high |
| Record the speed | 2 — no rate acceleration | 4 — 4-bit speed select | 5 — mobile-optimized static activation |
| Scope of delivery | 4 – loading but heavy | 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX | 3 — LiteRT-LM, Transformers.js, edge oriented |
| Device accessibility | 1 — requires a large GPU | 4 – consumer GPU, Raspberry Pi 5 | 5 — works on phones |
| Total (/25) | 13 | 21 | 21 |
The winner
The result is a tie by design. Q4_0 QAT and QAT mobile both have 21 results, but for different platforms. For phones, the mobile format is leading. It reaches about 1GB in E2B and targets mobile accelerators directly. For laptops and consumer GPUs, the Q4_0 QAT is the default. BF16 remains a quality reference, not a local choice.
Methodology and limitations
Memory statistics are from Google's Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google's stated claim. No independent numbers for the quality of the Gemma 4 QAT were published at release. We did not use models for this comparison. Developers should check their values and functionality before building.
Key Takeaways
- Q4_0 QAT cuts Gemma 4 E2B to 3.2 GB and E4B to 5 GB, from 9.6 GB and 15 GB in BF16.
- The new QAT mobile schema brings E2B to about 1 GB; text only without PLE goes under 1 GB.
- QAT changes quality by size, not size itself; The mobile format drives more memory cuts.
- Google claims higher quality than PTQ but did not publish benchmark numbers for Gemma 4 QAT when it was released.
- Weights are shipping today for Hugging Face with support for llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM.
Marktechpost Visual Explainer
Check it out Model weights (Q4_0 QAT collection, Mobile QAT collection) and Google blog (QAT release). Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us



