Gemma 4 QAT: int4 weights without the perplexity tax

What happened

Google dropped quantization-aware-trained (QAT) variants of the Gemma 4 family — 2B, 9B, and 27B — on Hugging Face and Kaggle, with int4 weights that the company claims sit within rounding distance of the bf16 originals on standard benchmarks. The pitch is straightforward: instead of training in bf16 and letting users post-train quantize (PTQ) to int4 and eat the perplexity loss, Google ran the last leg of training with quantization in the loop, so the model learned to be int4-shaped.

The 27B QAT checkpoint weighs in around 14GB on disk, down from ~54GB for bf16, which is the difference between "needs a datacenter card" and "runs on a 4090 or an M3 Max with headroom for context." The 9B drops to roughly 6GB, and the 2B fits comfortably under 2GB — small enough to ship inside a desktop app installer without anyone noticing. Google published GGUF, MLX, and standard safetensors variants on day one, which is the part that matters for anyone who's spent a weekend hand-converting checkpoints for llama.cpp.

The benchmark deltas Google reports are small: a fraction of a point on MMLU, single-digit-percent regressions on the harder reasoning suites, and near-identical scores on instruction-following evals. The interesting number isn't the absolute score — it's that the gap between bf16 and int4 is now smaller than the gap between Gemma 4 and the model one notch below it in the family.

Why it matters

Post-training quantization has been the default for two years because it's cheap: take the released bf16 weights, run a calibration pass, ship int4. The cost is paid in silent quality regressions that nobody benchmarks consistently — the Reddit threads where people say "int4 feels dumber" are real, and the GPTQ/AWQ papers quantify it at 1–5 points on hard benchmarks depending on the model and the method.

QAT pays the cost upfront: you simulate int4 rounding during the forward pass in training, let gradients flow through a straight-through estimator, and the optimizer learns weights that round well. It's more expensive to train but the artifact is the deployment artifact — there's no "original" the int4 version is a degraded copy of, because the int4 version is the original. Apple has been doing this for CoreML models for years; Meta shipped QAT Llama 3.2 variants for the on-device 1B/3B; Microsoft's Phi line uses it. Google going QAT-first for the open Gemma drops is the signal that this is now table stakes, not a research curiosity.

The community reaction on HN (310 points, ~200 comments) split predictably. The llama.cpp crowd is testing whether the GGUF variants actually beat existing Q4_K_M quants of the bf16 release on the same hardware — early reports say yes by 1–2 points on lm-eval-harness, which is the right comparison. The skeptics are pointing out that QAT-int4 is still int4, and the genuinely useful low-bit work (1.58-bit BitNet, 2-bit AQLM with 2.5GB 70B models) is where real density gains live. Both are right. QAT-int4 isn't a frontier-pushing technique; it's a quality-of-life upgrade that closes the PTQ tax. The frontier is somewhere south of 2 bits.

There's also a quieter story about distribution. Google is using QAT to make the 27B usable on hardware that Llama 3.1 70B can't touch, which lets Gemma compete on the "best model that fits" axis instead of the "best model overall" axis where it loses to Llama and Qwen. A 27B that fits in 14GB and runs at 30+ tok/s on a 4090 is a different product than a 70B that needs two A6000s. For local-first workflows — coding assistants, RAG over private docs, agent loops where you don't want roundtrips to an API — the fit-on-one-GPU bracket is the only bracket that matters.

What this means for your stack

If you're running Gemma 4 in production with a PTQ int4 conversion (AWQ, GPTQ, bitsandbytes nf4), swap to the QAT checkpoint and re-run your evals. The migration is one wget and a config change. If the QAT variant doesn't beat your PTQ variant on your domain evals, that's a real finding — file an issue, because Google is going to want to know.

If you've been holding off on local Gemma because the 27B didn't fit on your dev box and the 9B wasn't smart enough, the 27B QAT is the configuration to test. The practical workflow: pull the GGUF from Hugging Face, run it under llama.cpp or LM Studio, point your existing OpenAI-compatible client at the localhost endpoint, and measure tokens/sec and quality on your actual prompts before committing. Don't trust the MMLU numbers; trust your eval set.

If you're building a desktop app and shipping a model inside it, the 2B QAT at <2GB is small enough to bundle without a download-on-first-launch UX. That's a real shift — "the model ships with the app" was a 1B-class feature six months ago. The 2B class is now there for tasks where the 1B couldn't quite hold instruction-following.

Looking ahead

The interesting question isn't whether QAT-int4 wins — it does, marginally. The question is whether the next Gemma release skips bf16 entirely and ships int4 as the only weights, with the bf16 master kept internal. That's the logical endpoint: if QAT produces deployment-ready weights, the bf16 checkpoint is just a development artifact, and there's no reason to publish it. Meta and Google are both walking toward that, and the first vendor to ship "int4 is the model" with a straight face will reset expectations for everyone else.

Gemma 4 QAT: int4 weights without the perplexity tax

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Gemma 4 QAT: int4 weights without the perplexity tax

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

// share this