Generative AI

Google DeepMind Releases Gemma 4 QAT Checkpoints: Q4_0 and Mobile Format Cut On-Device Memory

Google DeepMind has released Quantization-Aware Training (QAT) test environments for the Gemma 4 family. The release targets local use on edge devices and consumer GPUs. It follows the launch of the Gemma 4 in April and the 12B model two days earlier.

We compared the available formats of the Gemma 4 model using only published numbers. The goal was simple. Show how much each level of precision costs in memory. Then show what QAT actually changes.

What QAT actually does

Quantization degrades the model by reducing the weighting accuracy. Standard Post-Training Quantization (PTQ) compresses the completed model. That tends to lower the quality. QAT instead simulates quantization during training. The model learns to compensate for the loss of precision.

Google's AI team claims that its QAT results are of a much higher quality than standard PTQ scores. Google did not publish Gemma 4 QAT benchmark scores in the announcement. In context, the Gemma 3 QAT determined a 54% reduction in confusion for Q4_0 using the llama.cpp test. We cite that only as a precursor to the previous generation.

Comparison function

Compare Gemma 4 E2B and E4B in all three formats. The formats are BF16, Q4_0 QAT, and the new QAT mobile schema. Place them in memory, quality preservation, and device accessibility. Use only published figures.

Effects of memory

Format E2B E4B The foundation
BF16 (16-bit) 9.6 GB 15 GB Official Gemma 4 docs
Q4_0 (4-bit, QAT) 3.2 GB 5 GB Official Gemma 4 docs
Mobile (QAT, E2B) ~1 GB QAT announcement

Q4_0 calculations are similar to the PTQ Q4_0 trace. QAT does not resize in a given format. It improves quality for that size. The new mobile schema brings further reductions.

Using that mobile schema, Google reduced the Gemma 4 E2B to around 1GB. Engineers can always come down. A text-only model with no Per-Layer Embedding requires less than 1GB, minus the audio and visual embeddings.

A breakdown of each format

BF16 is a quality base. E2B requires 9.6 GB and E4B requires 15 GB. It is a reference point, not the destination of the call.

Q4_0 QAT is a general purpose local format. E2B goes down to 3.2 GB and E4B to 5 GB. QAT maintains more quality here than PTQ at the same size. This format is compatible with consumer GPUs. The previous E2B test also ran on a Raspberry Pi 5 at INT4.

The mobile format is a special schema. It brings E2B to about 1 GB. It uses static activation, channel-wise equalization, and 2-bit target compression.

How we work mobile schema

The Google AI team developed four methods for mobile hardware. Fixed activation precalculates scaling during training, reducing device performance. Channel-wise scaling fits the design of mobile accelerators. Target 2-bit scaling only compresses the token generation layers. The embedding and optimization of the KV cache shortens the working memory.

The main thinking layers remain with high accuracy. That saves power while cutting storage. Developers can also use text only and omit audio and visual encoders. That further reduces memory for use in situations that do not require multimodality.

Dimension classification

Scores the quality level of the formats used on the device. Memory is the only axis that is heavily weighted. Quality reflects Google's disclosed design, not measured by Gemma 4 numbers. Each score has a one-line basis.

Size BF16 Q4_0 QAT QAT Mobile
Foot memory 1 — heavy, 9.6 GB E2B 4 — 3.2 GB E2B 5 — ~1 GB E2B text only
Quality preservation 5 – full precision basis 4 – QAT is kept, close to the base 3 — 2-bit token layers, the core is kept high
Record the speed 2 — no rate acceleration 4 — 4-bit speed select 5 — mobile-optimized static activation
Scope of delivery 4 – loading but heavy 5 — llama.cpp, Ollama, LM Studio, vLLM, MLX 3 — LiteRT-LM, Transformers.js, edge oriented
Device accessibility 1 — requires a large GPU 4 – consumer GPU, Raspberry Pi 5 5 — works on phones
Total (/25) 13 21 21

The winner

The result is a tie by design. Q4_0 QAT and QAT mobile both have 21 results, but for different platforms. For phones, the mobile format is leading. It reaches about 1GB in E2B and targets mobile accelerators directly. For laptops and consumer GPUs, the Q4_0 QAT is the default. BF16 remains a quality reference, not a local choice.

Methodology and limitations

Memory statistics are from Google's Gemma 4 documentation. The ~1GB E2B figure comes from the QAT announcement. Quality is Google's stated claim. No independent numbers for the quality of the Gemma 4 QAT were published at release. We did not use models for this comparison. Developers should check their values ​​and functionality before building.

Key Takeaways

  • Q4_0 QAT cuts Gemma 4 E2B to 3.2 GB and E4B to 5 GB, from 9.6 GB and 15 GB in BF16.
  • The new QAT mobile schema brings E2B to about 1 GB; text only without PLE goes under 1 GB.
  • QAT changes quality by size, not size itself; The mobile format drives more memory cuts.
  • Google claims higher quality than PTQ but did not publish benchmark numbers for Gemma 4 QAT when it was released.
  • Weights are shipping today for Hugging Face with support for llama.cpp, Ollama, LM Studio, vLLM, MLX, and LiteRT-LM.

Marktechpost Visual Explainer

Marktechpost · Benchmark

Gemma 4 QAT: Comparing Q4_0 and the New Mobile Format

Google DeepMind released Quantization-Aware Training benchmarks for Gemma 4. We compared the edge model formats to published numbers.

Formats are compared

BF16 (16-bit) · Q4_0 QAT (4-bit) · QAT Mobile

June 5, 2026

Comparison Function

What we have lined up for you

$ compare gemma-4 --models E2B,E4B 
    --formats BF16,Q4_0-QAT,MOBILE-QAT 
    --rank memory,quality,accessibility 
    --source published-only --no-self-run

Memory from the official Gemma 4 documentation. Quality from Google's stated claim. There are no active models in the area.

Format 1 of 3 · Index

BF16 (16-bit)

13 / 25

A fully accurate quality foundation. E2B requires 9.6 GB and E4B requires 15 GB.

Top view: reference point, not target phone or laptop.

Format 2 of 3 · Laptop / GPU

Q4_0 QAT (4-bit)

21 / 25

General purpose local format. E2B goes down to 3.2 GB and E4B to 5 GB.

High observation: QAT maintains higher quality than PTQ at the same 4-bit size.

Format 3 of 3 · Mobile

QAT Mobile

21 / 25

A special schema on the edge. It brings E2B to about 1 GB.

High precision: 2-bit in token layers, logic layers are stored with high precision.

Leaderboard

Full level

Size BF16 Q4_0 QAT QAT Mobile
Foot memory 1 4 5
Quality preservation 5 4 3
Record the speed 2 4 5
Scope of delivery 4 5 3
Device accessibility 1 4 5
Total 13 21 21

Binding by design: Q4_0 wins laptops and GPUs; mobile wins phones.

Key Takeaways

What developers need to know

  • Q4_0 QAT cuts E2B to 3.2 GB and E4B to 5 GB, up from 9.6 GB and 15 GB in BF16.
  • The new QAT mobile schema brings E2B to about 1 GB; text only without PLE goes under 1 GB.
  • QAT changes quality by a certain size; The mobile format drives more memory cuts.
  • Google claims higher quality than PTQ but has not published Gemma 4 QAT numbers.
  • Weights are shipping today for Hugging Face with support for llama.cpp, Ollama, vLLM, and MLX.

Check it out Model weights (Q4_0 QAT collection, Mobile QAT collection) and Google blog (QAT release). Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button