compared to gaunernst/gemma-3-27b-it-int4-awq

#1
by azidanit - opened

compared to gaunernst/gemma-3-27b-it-int4-awq with this quantized model, which one you prefer and recommend?

I would recommend gaunernst/gemma-3-27b-it-qat-compressed-tensors. Let me explain the difference

I want to emphasize again that these 2 checkpoints are different, not only in their embeddings, but also other layers. Google never really explains why. Accuracy-wise, the latter checkpoint is better (gaunernst/gemma-3-27b-it-qat-compressed-tensors). Hence, there is no reason to use the former one (gaunernst/gemma-3-27b-it-int4-awq), since you can't take advantage of INT4 embedding anyway. I keep the former around since it looks like some people are using it, but you really should use the latter.

(Btw, Google blogpost is misleading https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/. Their VRAM usage graph is based on the version with INT4 embedding, but the actual released GGUFs doesn't have INT4 embedding. Won't matter much for 27B, but matters a lot for 1B - It's 0.5GB vs 1GB difference)

thanks for the insights!
are you planning to do another release based on the unquantized QAT weights? re-quantizing an already quantized model seems a bit flawed. unfortunately I don't have the skills or time (yet) to do it myself :(

@hfmon wdym? This is not re-quantizing. I just convert the GGUF to compressed-tensors format so it can be used with vLLM/SGLang. There is no extra quantization happening here.

Why do you need a quantized version of "the unquantized QAT weights"? That would be the QAT version, right?

ok you're right I was somewhat fuzzy in my language. The models where trained quantization-aware (i.e. prepared for later quantization), but come also as unquantized (b)f16.
then, google already did some work and quantized them to e.g. Q4 GGUF. this leads to losses due to the GGUF-quantization. now the GGUF->AWQ/CT conversion may not lead to new losses (I think I misinterpreted that part, but not sure, since you're doing another round of Calibration, too?) (except when using Flax?), but are based on GGUF.
However, one could also go straight from Unquantized -> AWQ/CT, which comes with its own quantization algorithm (and calibration) that leads to a different loss then GGUF-quantization. And according to some benchmarks they might even sometimes be favorable, which is why my intuition was to prefer those.

Maybe I've got it wrong, idk.

Sign up or log in to comment