compared to gaunernst/gemma-3-27b-it-int4-awq

by azidanit - opened Apr 21

Discussion

azidanit

Apr 21

compared to gaunernst/gemma-3-27b-it-int4-awq with this quantized model, which one you prefer and recommend?

gaunernst

Owner Apr 21

I would recommend gaunernst/gemma-3-27b-it-qat-compressed-tensors. Let me explain the difference

gaunernst/gemma-3-27b-it-int4-awq: this was converted from Flax https://www.kaggle.com/models/google/gemma-3/flax. The Flax checkpoint has INT4 embedding, but due to AutoAWQ limitation, I can't use INT4 embedding in my converted checkpoint
gaunernst/gemma-3-27b-it-qat-compressed-tensors: this was converted from GGUF https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf. The embedding is not quantized (FP16 in GGUF, converted to BF16 in my checkpoint)

I want to emphasize again that these 2 checkpoints are different, not only in their embeddings, but also other layers. Google never really explains why. Accuracy-wise, the latter checkpoint is better (gaunernst/gemma-3-27b-it-qat-compressed-tensors). Hence, there is no reason to use the former one (gaunernst/gemma-3-27b-it-int4-awq), since you can't take advantage of INT4 embedding anyway. I keep the former around since it looks like some people are using it, but you really should use the latter.

(Btw, Google blogpost is misleading https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/. Their VRAM usage graph is based on the version with INT4 embedding, but the actual released GGUFs doesn't have INT4 embedding. Won't matter much for 27B, but matters a lot for 1B - It's 0.5GB vs 1GB difference)

hfmon

Apr 28

thanks for the insights!
are you planning to do another release based on the unquantized QAT weights? re-quantizing an already quantized model seems a bit flawed. unfortunately I don't have the skills or time (yet) to do it myself :(

gaunernst

Owner Apr 28

@hfmon wdym? This is not re-quantizing. I just convert the GGUF to compressed-tensors format so it can be used with vLLM/SGLang. There is no extra quantization happening here.

Why do you need a quantized version of "the unquantized QAT weights"? That would be the QAT version, right?

hfmon

Apr 28

ok you're right I was somewhat fuzzy in my language. The models where trained quantization-aware (i.e. prepared for later quantization), but come also as unquantized (b)f16.
then, google already did some work and quantized them to e.g. Q4 GGUF. this leads to losses due to the GGUF-quantization. now the GGUF->AWQ/CT conversion may not lead to new losses (I think I misinterpreted that part, but not sure, since you're doing another round of Calibration, too?) (except when using Flax?), but are based on GGUF.
However, one could also go straight from Unquantized -> AWQ/CT, which comes with its own quantization algorithm (and calibration) that leads to a different loss then GGUF-quantization. And according to some benchmarks they might even sometimes be favorable, which is why my intuition was to prefer those.

Maybe I've got it wrong, idk.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment