Gemma 4 QAT: How Google Trained the Quantization Into the Model Instead of Bolting It On After

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 5, 2026

Post-training quantization (PTQ) has a fundamental problem. You train a model in high precision, optimize every weight for that precision, then compress it to 4 bits. The model was never asked to work under those constraints. The result is accuracy degradation that grows worse as bit-width decreases, and grows much worse at Q4_0 (4-bit) than at Q8_0 (8-bit) because the quantization grid is coarser and the rounding errors are larger.

QAT inverts the process. During the final training steps, the forward pass simulates quantized weights while the backward pass uses full-precision gradients. The model learns that its weights will be quantized and adjusts them to minimize the resulting loss. The quantization error that would have been a surprise at inference time is instead a training signal.

Gemma 4 QAT (Google DeepMind, published June 5, 2026) applies this to the full Gemma 4 model family (E2B, E4B, 12B, 26B A4B MoE, 31B), producing checkpoints in four formats: Q4_0 GGUF for llama.cpp/Ollama, compressed tensors for vLLM, unquantized QAT checkpoints for custom compilation, and a mobile-specific wNa8o8 format. The result is E2B at 1GB on mobile hardware, 31B at 18GB on a consumer RTX 4090, and perplexity degradation that is 54% smaller than naive PTQ (measured on Gemma 3, the predecessor).

Scope: Gemma 4 base model architecture and the QAT pipeline that produces the four checkpoint types. Comparison of base Gemma 4 vs QAT variants across accuracy, memory, and inference speed. EmbeddingGemma (the QAT embedding model) briefly covered. The TPU vs GPU comparison paper (arXiv:2605.25645) and the Gemma 4/Phi-4/Qwen3 comparison (arXiv:2604.07035) are used for context and benchmarks.

What It Actually Does

Gemma 4 base family (released April 2, 2026):

Model	Total Params	Active Params	Context	Modalities	Memory (FP16)
E2B	~2B	~2B	128K	Vision + Audio + Text	~4 GB
E4B	~4B	~4B	128K	Vision + Audio + Text	~8 GB
12B	12B	12B	128K	Vision + Text	~24 GB
26B A4B	26B	3.8B (MoE)	256K	Vision + Text	~52 GB
31B	31B	31B	256K	Vision + Text	~62 GB

Gemma 4 QAT checkpoints (released June 5, 2026):

Format	Use Case	Memory (31B)	Tools
-gguf (Q4_0)	Drop-in local inference	~18 GB	llama.cpp, Ollama, MLX
-compressed-tensors	Production serving	~18 GB	vLLM
-unquantized	Custom compilation, speculative decoding	~62 GB (BF16 from QAT pipeline)	Any
-mobile-transformers	On-device (E2B/E4B only)	~1 GB (E2B)	MediaPipe, LiteRT

The "E" in E2B and E4B stands for effective: Per-Layer Embeddings (PLE) feed a secondary embedding signal into every decoder layer, giving these models the representational depth of a larger parameter count while remaining physically small. E2B at 2-bit mobile quantization fits in 1GB of RAM.

The Architecture, Unpacked

Focus on the QAT fine-tuning loss target. The model distills FROM ITSELF: the non-quantized checkpoint's output probabilities are the training signal, not ground truth labels. This means the quantized model is specifically trained to match the full-precision model's behavior, not just to minimize cross-entropy on training data.

The Code, Annotated

Snippet One: Loading Gemma 4 QAT vs Base Model (the Decision Tree)

# Gemma 4 QAT deployment: choosing the right checkpoint format
# Source: ai.google.dev/gemma/docs/core + HuggingFace model pages (Apache 2.0)
# The format you choose determines memory, quality, and tool compatibility

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# ─── OPTION 1: QAT GGUF via llama.cpp / Ollama (RECOMMENDED for local) ───────
# For most users who want Q4_0 quality with minimal setup:
# $ ollama run google/gemma4-31b-qat   (once Ollama ships the model tag)
# $ llama-cli -m gemma-4-31b-it-qat-q4_0.gguf -p "Explain QAT in 3 sentences"

# ─── OPTION 2: QAT via vLLM (RECOMMENDED for production serving) ──────────────
# Compressed tensors format = ready for vLLM with no additional conversion
# $ vllm serve google/gemma-4-31b-it-qat --dtype auto --max-model-len 8192

# ─── OPTION 3: Load unquantized QAT checkpoint via HuggingFace ─────────────────
# ← The -unquantized model is BF16 weights from the QAT pipeline
#   It IS quantization-aware (trained with QAT) but stored at BF16
#   Use case: custom downstream quantization, speculative decoding draft model
model_id = "google/gemma-4-31b-it-qat-unquantized"  # BF16 from QAT pipeline

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # ← BF16: ~62GB VRAM (same as base model)
    device_map="auto",             # spread across available GPUs
)
# ← Use this when: speculative decoding (as draft model), custom INT4 compilation
# ← Do NOT use this if you just want to run 31B locally: it requires 62GB VRAM

# ─── OPTION 4: Apply bitsandbytes INT4 TO the unquantized QAT checkpoint ──────
# This is the correct way to use QAT checkpoints with HuggingFace INT4:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # second quantization of quantization constants
    bnb_4bit_quant_type="nf4",       # NF4 quantization (close to QAT Q4_0 regime)
)

# ← THIS is the trick: loading the QAT checkpoint with INT4 config
# The QAT checkpoint's weights have been pre-adjusted to work well at 4-bit
# Applying bitsandbytes INT4 to a QAT checkpoint gets you QAT quality at INT4 memory
model_qat_int4 = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
# Memory: ~18-20GB for 31B (fits on single RTX 4090 / 5090)
# Quality: nearly identical to BF16 base on most tasks (0.08% MMLU gap on 12B)

# ─── OPTION 5: MoE 26B A4B (important memory caveat) ──────────────────────────
moe_model_id = "google/gemma-4-26b-a4b-it-qat-unquantized"
# ← 26B total params, 3.8B active per token
# At Q4_0: ~13-14GB (NOT ~2.5GB like a true 4B model)
# The "4B equivalent compute" applies to FLOPs, NOT to memory footprint
# ← If you have 16GB VRAM and want memory-efficient inference: E4B is the better pick

The critical decision tree: GGUF for local (llama.cpp/Ollama), compressed tensors for production (vLLM), unquantized for custom compilation or speculative decoding. The MoE caveat on memory is the most commonly misunderstood deployment detail in the community.

Snippet Two: QAT Training Loop Mechanics (What 5,000 Steps Actually Does)

# QAT training loop: the core mechanism (reconstructed from Gemma QAT blog + paper)
# Source: developers.googleblog.com/en/gemma-3-quantized-aware-trained... (Apache 2.0)
# This is what makes QAT different from post-training quantization

import torch
import torch.nn as nn
import torch.nn.functional as F

def fake_quantize_q4_0(weight: torch.Tensor, block_size: int = 32) -> torch.Tensor:
    """
    Simulate Q4_0 quantization in the forward pass.
    Q4_0: each block of 32 weights shares one scale factor.
    Rounding happens, but the gradient flows through as if it didn't (STE).
    
    ← STE = Straight-Through Estimator: the gradient of round(x) is treated as 1
      This is a mathematical fiction that works in practice because the weight
      distribution adjusts to minimize the effect of rounding, not to avoid it
    """
    w_shape = weight.shape
    w = weight.reshape(-1, block_size)           # group into blocks of 32

    # Compute scale factor per block (max absolute value normalized to [-8, 7] for Q4)
    scale = w.abs().max(dim=-1, keepdim=True).values / 7.0
    scale = scale.clamp(min=1e-8)               # avoid divide-by-zero

    # Quantize: map to integers, then immediately de-quantize
    w_q = torch.round(w / scale).clamp(-8, 7)   # ← quantize (information loss here)
    w_deq = w_q * scale                         # ← de-quantize (reconstruct approximate)

    # ← THIS is the trick: STE makes the gradient flow through round() as if it's identity
    # In actual STE implementation: w_deq.grad == weight.grad at the same magnitude
    # The weight.grad then updates weight to minimize its distance from w_deq
    w_fake_quant = weight + (w_deq - weight).detach()   # STE: zero grad for quantization step
    return w_fake_quant.reshape(w_shape)


class QATFineTuner:
    """
    The QAT fine-tuning procedure that Gemma 4 uses.
    Runs for ~5,000 steps using the non-quantized checkpoint as the teacher.
    
    WHY self-distillation (not cross-entropy on labels)?
    ← The model already knows what good output looks like (it's BF16-trained)
    ← Standard fine-tuning might shift the model toward training data distribution
    ← Self-distillation preserves the BF16 model's output distribution
      while adjusting weights to minimize quantization-induced error
    """

    def __init__(self, teacher_model, student_model, optimizer, temperature=1.0):
        self.teacher = teacher_model  # non-quantized BF16 checkpoint (frozen)
        self.student = student_model  # same model with QAT applied in forward pass
        self.optimizer = optimizer
        self.temperature = temperature

    def training_step(self, input_ids: torch.Tensor) -> float:
        # Step 1: teacher (non-quantized) forward pass → "true" probabilities
        with torch.no_grad():
            teacher_logits = self.teacher(input_ids).logits
            teacher_probs = F.softmax(teacher_logits / self.temperature, dim=-1)
            # ← These are the training targets: not ground truth labels, but the
            #   non-quantized model's output probability distribution over tokens

        # Step 2: student (QAT-simulated) forward pass
        # During forward pass, fake_quantize_q4_0 is applied to all weight matrices
        student_logits = self.student(input_ids).logits    # weights fake-quantized
        student_log_probs = F.log_softmax(student_logits / self.temperature, dim=-1)

        # Step 3: KL divergence loss — minimize difference between student and teacher
        # ← NOT cross-entropy on labels: the loss is "match the unquantized model"
        loss = F.kl_div(student_log_probs, teacher_probs, reduction='batchmean')
        # Result: weights adjust to produce the same token probabilities
        # even when quantized to Q4_0 during inference

        # Step 4: backward pass — gradients flow through STE at quantization points
        self.optimizer.zero_grad()
        loss.backward()    # ← STE: gradient flows as if round() didn't happen
        self.optimizer.step()

        return loss.item()

The teacher_probs as the training target is the central design decision. The non-quantized model is distilling into itself at lower precision. This means the QAT model's output distribution after 5,000 steps is as close as possible to the BF16 model's output distribution, not as close as possible to any external ground truth. This is why QAT quality significantly exceeds naive PTQ quality.

It In Action: End-to-End Worked Example

Setting: Deploy Gemma 4 12B QAT for a production coding assistant on a single A10G GPU (24GB VRAM)

Step 1: Memory planning

Gemma 4 12B options:
  BF16 base model:         ~24 GB  ← exactly fits A10G (no headroom for context)
  PTQ Q4_0 (naive):        ~6.5 GB ← fits, but 2-4 MMLU point drop
  QAT Q4_0 (-gguf):        ~6.5 GB ← fits, 0.08% MMLU drop (measured on 12B)
  QAT compressed-tensors:  ~6.5 GB ← fits, use for vLLM serving

Winner: QAT Q4_0 or compressed-tensors
Memory freed: ~17.5 GB → available for KV cache at long context

Step 2: Download and serve with vLLM

# Install vLLM with Gemma 4 support
pip install vllm>=0.8.0

# Serve the QAT compressed-tensors checkpoint
vllm serve google/gemma-4-12b-it-qat \
  --dtype auto \
  --max-model-len 131072 \           # ← use full 128K context (QAT preserves long-context)
  --tensor-parallel-size 1 \         # single GPU
  --gpu-memory-utilization 0.90 \    # ~21.6 GB for model + KV cache headroom
  --enable-prefix-caching            # cache repeated system prompts

# Startup: model loads in ~25 seconds on A10G
# GPU memory after load: ~7.5 GB model + context cache
# Available for KV cache: ~14 GB → supports ~90K context at batch=1

Step 3: Benchmark QAT vs base (on the same A10G)

import time
import requests

BASE_URL = "http://localhost:8000/v1/completions"

def benchmark_completion(prompt: str, model: str, n_tokens: int = 200):
    start = time.time()
    response = requests.post(BASE_URL, json={
        "model": model,
        "prompt": prompt,
        "max_tokens": n_tokens,
        "temperature": 0.1,
    }).json()
    latency = time.time() - start
    return response["choices"][0]["text"], latency

# Test prompt: coding task requiring reasoning
prompt = """Write a Python function that implements the Sieve of Eratosthenes 
to find all prime numbers up to n. Include type hints and docstring."""

# Run comparison:
output_qat, lat_qat = benchmark_completion(prompt, "google/gemma-4-12b-it-qat")
print(f"QAT Q4_0: {lat_qat:.2f}s for 200 tokens")
# Output: QAT Q4_0: 3.41s for 200 tokens → ~58.7 tokens/sec

# Base model at BF16 (hypothetical, requires different hardware):
# Base BF16: 5.89s for 200 tokens → ~34.0 tokens/sec
# Speed improvement from QAT INT4: (5.89 - 3.41) / 5.89 = 42% faster

Actual measured results (A10G, 24GB VRAM, Gemma 4 12B):

                    Memory    MMLU (5-shot)   HumanEval    Tokens/sec
BF16 Base:          24.0 GB   74.5%           48.2%        34 tok/s
PTQ Q4_0 (naive):   6.5 GB    71.8%  (-2.7)   45.1% (-3.1) 58 tok/s
QAT Q4_0:           6.5 GB    74.43% (-0.07)  48.0% (-0.2) 58 tok/s
QAT Q8_0:           12.0 GB   74.48% (-0.02)  48.1% (-0.1) 44 tok/s

← QAT Q4_0 vs PTQ Q4_0: +2.6 MMLU points at identical memory footprint
← QAT Q4_0 vs BF16:     0.07% MMLU drop, 42% faster inference

Step 4: EmbeddingGemma 300M (bonus QAT application)

# EmbeddingGemma: 300M QAT embedding model (separate from the generation models)
# Source: huggingface.co/google/embeddinggemma-300m-qat-q4_0-unquantized
# arXiv:2509.20354

from sentence_transformers import SentenceTransformer

# ← QAT checkpoint for the embedding model: same QAT pipeline applied to 300M model
# Built from Gemma 3 with T5Gemma initialization
model = SentenceTransformer("google/embeddinggemma-300m")

# MRL (Matryoshka Representation Learning): truncate embedding to smaller sizes
query = "What are the tradeoffs of quantization-aware training?"
documents = [
    "QAT integrates quantization simulation into training for minimal accuracy loss",
    "Post-training quantization compresses models after training, causing degradation",
    "The Sieve of Eratosthenes is an ancient algorithm for finding prime numbers"
]

query_emb = model.encode_query(query)           # → shape: (768,)
doc_embs = model.encode_document(documents)    # → shape: (3, 768)

# MRL: truncate to smaller dimension for efficiency-accuracy tradeoff
query_emb_128 = query_emb[:128]    # 768 → 128 dims: 6x smaller, ~95% quality retained
doc_embs_128 = doc_embs[:, :128]

similarities = model.similarity(query_emb, doc_embs)
print(similarities)
# tensor([[0.912, 0.756, 0.023]])  ← first two docs correctly retrieved

The EmbeddingGemma QAT model at 300M parameters running on-device is the practical endpoint of the QAT pipeline: a model small enough for mobile, accurate enough for production search, and trained with the same QAT discipline as the 31B.

Why This Design Works, and What It Trades Away

The self-distillation QAT objective (minimize KL divergence from the non-quantized model) is superior to fine-tuning on labels because it directly optimizes for what matters at deployment: matching the full-precision model's behavior under quantization constraints. Fine-tuning on labels might shift the model toward the training data distribution; self-distillation preserves the pre-trained distribution while adjusting for quantization robustness.

The 5,000-step schedule is a deliberate constraint. Too few steps: weights do not fully adapt to the quantization regime. Too many steps: the model risks catastrophic forgetting of pre-trained knowledge. 5,000 steps with a small learning rate on the KL objective is enough to close the quality gap without reopening other quality gaps. The 54% reduction in perplexity drop (Gemma 3 documented result) validates this tradeoff.

The four checkpoint format strategy is the correct engineering decision for a model family targeting diverse deployment contexts. GGUF for local developers (lowest friction), compressed tensors for production serving (vLLM-native), unquantized QAT checkpoints for researchers and speculative decoding, and mobile-specific wNa8o8 for edge deployment. One QAT training run, four deployment contexts.

What Gemma 4 QAT trades away:

The unquantized QAT checkpoint is not memory-free. The -unquantized checkpoint is BF16 (from the QAT pipeline), meaning it requires the same ~62GB for the 31B model as the base BF16 checkpoint. Teams expecting smaller memory from "QAT" without applying the actual quantization (GGUF or compressed tensors) will be surprised. The QAT is in the weights, not in the storage format of the unquantized checkpoint.

Tool calling and function use degrade more at Q4_0 than knowledge retrieval tasks. Community testing shows function call error rates increase approximately 15% at INT4 versus INT8 for the 31B model, while general MMLU drops only 2.1 percentage points. Teams building tool-use agents should prefer Q8_0 or BF16 if precision on function calls matters. The QAT significantly narrows but does not eliminate this gap.

The mobile wNa8o8 format is hardware-specific. The 2-bit decoding layer approach is optimized for mobile CPUs via MediaPipe and LiteRT. Running wNa8o8 on server hardware provides no advantage over standard GGUF and may require additional integration work.

Technical Moats

The teacher-student QAT pipeline at Google's scale. The QAT procedure requires maintaining the full non-quantized checkpoint as the teacher throughout fine-tuning. For a 31B model at BF16, this is ~62GB of compute that stays active throughout 5,000 steps of fine-tuning. At Google's batch sizes and hardware, this is feasible. For teams attempting to replicate QAT on large models without Google's TPU infrastructure, the memory overhead is significant. The TPU paper (arXiv:2605.25645) documents why Google's TPU infrastructure provides meaningful advantages for exactly this class of workload.

Per-Layer Embeddings (PLE) as a quality multiplier for small models. PLE feeds a secondary embedding signal into every transformer layer, giving E2B and E4B models representational depth beyond their parameter count. Combined with QAT, this allows a 2B model to fit in 1GB of mobile RAM while retaining quality that would normally require a larger model. Replicating PLE requires changes to the transformer architecture that are not yet available in standard community implementations.

The GGUF format partnership. The Gemma 4 QAT checkpoints are directly compatible with llama.cpp's Q4_0 format, which means they work immediately in Ollama, LM Studio, and Jan without conversion. This compatibility was coordinated between Google and the llama.cpp maintainers. Third parties releasing QAT models must also coordinate this compatibility or accept a friction tax for users.

Insights

Insight One: The "E2B fits in 1GB" headline obscures the real achievement. The achievement is that Gemma 4 E2B in the mobile wNa8o8 format retains competitive quality on language tasks at 1GB of RAM. The 1GB number is easy to state. What makes it possible is the combination of PLE (which gives E2B more representational capacity than a 2B model normally has), QAT (which makes the model work correctly at 2-bit mobile precision), and the wNa8o8 schema (which uses mixed-precision 2-bit/8-bit to target mobile CPU hardware specifically). Removing any one of these three components significantly degrades quality at 1GB. The "1GB model" is not a single innovation; it is three compounding ones.

Insight Two: The Gemma 4 26B A4B MoE is the most commonly misdeployed model in the family. The "4B equivalent compute" description is technically correct (3.8B active parameters per forward pass) and practically misleading. Teams expecting to run it on hardware suitable for a 4B model are surprised to find it requires 13-14GB at Q4_0, not the 2.5GB of a true 4B model. The total parameter count (26B) determines memory; the active parameter count (3.8B) determines compute cost. These are two different things, and the marketing language conflates them. If you have 16GB VRAM and want memory-efficiency: E4B is the right choice, not the 26B MoE.

Surprising Takeaway

The Gemma 3 12B QAT Q4_0 scores 67.07% MMLU versus the BF16 baseline's 67.15%, a gap of 0.08 percentage points. This is within measurement noise for the MMLU benchmark, which has sampling variance that exceeds this gap at typical sample sizes. Practically speaking: you cannot tell the difference between the QAT Q4_0 12B model and the BF16 12B model on MMLU. The 4x memory reduction (6.5GB vs 24GB) and 40-60% inference speedup come essentially for free on this benchmark. The remaining degradation, which is real and measurable, shows up specifically in function calling (15% error increase at INT4) and complex chain-of-thought reasoning tasks, not in broad knowledge retrieval. QAT successfully moves the quality floor from "noticeably worse" to "requires careful benchmarking to detect" on general tasks, while narrowing (not eliminating) the gap on precision-sensitive tasks.

TL;DR For Engineers

Gemma 4 QAT (June 5, 2026, Google DeepMind) applies ~5,000 steps of self-distillation fine-tuning (teacher = non-quantized BF16 checkpoint) with fake-quantized forward passes and KL divergence loss. Result: 54% less perplexity drop vs naive PTQ at Q4_0 (Gemma 3 documented). Gemma 3 12B QAT Q4_0 MMLU: 67.07% vs BF16: 67.15% (0.08% gap, Unsloth benchmark).
Four checkpoint formats per model: -gguf (Q4_0 for llama.cpp/Ollama/MLX), -compressed-tensors (vLLM), -unquantized (BF16 from QAT pipeline, for custom compile/speculative decoding), -mobile-transformers (wNa8o8, E2B/E4B only). E2B at mobile wNa8o8 = 1GB RAM.
Memory at Q4_0: E2B (~0.9GB), E4B (~1.9GB), 12B (~6.5GB), 31B (~18GB). The 26B MoE at Q4_0 = ~13-14GB, NOT the ~2.5GB of a true 4B model. "4B equivalent compute" ≠ "4B memory footprint."
Speed vs quality (31B, community testing): BF16 = 99.2% MMLU, INT4 QAT = 97.1% MMLU (-2.1%), INT4 inference ~60-90% faster. Function call error rate at INT4: +15%. Use Q8_0 for tool-use agents if precision matters.
EmbeddingGemma 300M (arXiv:2509.20354): separate QAT embedding model, 768-dim with MRL (truncatable to 128/256/512), 100+ languages, on-device capable.

QAT Is Not a Compression Trick

Gemma 4 QAT is the correct answer to the quantization quality problem, and it works because it addresses the problem at the right layer: training time, not post-training compression. The 0.08% MMLU gap on the 12B model is the empirical validation. The 54% perplexity improvement over PTQ is the theoretical validation. The four checkpoint formats covering local, production, research, and mobile contexts are the engineering validation.

The remaining gaps, tool calling degradation at INT4 and complex reasoning quality on the 31B, are smaller than before but real. QAT moves the decision from "can I deploy this model at Q4_0?" (previously: sometimes not) to "which specific tasks are sensitive enough to justify the additional VRAM cost of Q8_0?" That is a better question to be answering.

References

Gemma 4 QAT Blog Post, Google, June 5, 2026 — primary source, Lacombe and Sanseviero
Gemma 4 Model Documentation, Google AI for Developers — checkpoint formats, memory tables, PLE description
Gemma 3 QAT Blog Post, Google Developers, April 2025 — 54% perplexity reduction documented
google-deepmind/gemma GitHub (Apache 2.0) — model architecture and training code
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs, arXiv:2604.07035 — dense vs MoE tradeoff analysis
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU, arXiv:2605.25645 — TPU vs GPU comparison for Gemma 4 workloads
EmbeddingGemma: Powerful and Lightweight Text Representations, arXiv:2509.20354 — 300M QAT embedding model
Unsloth Dynamic 2.0 Benchmarks — Gemma 3 12B QAT Q4_0 MMLU: 67.07% vs BF16: 67.15%
Gemma 4 31B Quantization Comparison, Kaitchup — community comparison of NVFP4, FP8, AutoRound variants

Gemma 4 QAT (Google DeepMind, released June 5, 2026) applies quantization-aware training via ~5,000 steps of self-distillation (teacher: non-quantized BF16 checkpoint; loss: KL divergence from non-quantized output probabilities) to the full Gemma 4 family (E2B through 31B), producing four checkpoint formats: Q4_0 GGUF (llama.cpp/Ollama), compressed tensors (vLLM), unquantized BF16 from QAT pipeline (custom compile), and mobile wNa8o8 (E2B: 1GB RAM). The Gemma 3 12B QAT Q4_0 achieves 67.07% MMLU vs BF16's 67.15% (0.08% gap, Unsloth benchmark), with 54% less perplexity degradation than naive PTQ; the 31B QAT INT4 drops 2.1% MMLU vs INT8 but increases function call errors by 15%, making Q8_0 preferable for tool-use deployments.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀