compared to gaunernst/gemma-3-27b-it-int4-awq
compared to gaunernst/gemma-3-27b-it-int4-awq with this quantized model, which one you prefer and recommend?
I would recommend gaunernst/gemma-3-27b-it-qat-compressed-tensors
. Let me explain the difference
gaunernst/gemma-3-27b-it-int4-awq
: this was converted from Flax https://www.kaggle.com/models/google/gemma-3/flax. The Flax checkpoint has INT4 embedding, but due to AutoAWQ limitation, I can't use INT4 embedding in my converted checkpointgaunernst/gemma-3-27b-it-qat-compressed-tensors
: this was converted from GGUF https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf. The embedding is not quantized (FP16 in GGUF, converted to BF16 in my checkpoint)
I want to emphasize again that these 2 checkpoints are different, not only in their embeddings, but also other layers. Google never really explains why. Accuracy-wise, the latter checkpoint is better (gaunernst/gemma-3-27b-it-qat-compressed-tensors
). Hence, there is no reason to use the former one (gaunernst/gemma-3-27b-it-int4-awq
), since you can't take advantage of INT4 embedding anyway. I keep the former around since it looks like some people are using it, but you really should use the latter.
(Btw, Google blogpost is misleading https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/. Their VRAM usage graph is based on the version with INT4 embedding, but the actual released GGUFs doesn't have INT4 embedding. Won't matter much for 27B, but matters a lot for 1B - It's 0.5GB vs 1GB difference)
thanks for the insights!
are you planning to do another release based on the unquantized QAT weights? re-quantizing an already quantized model seems a bit flawed. unfortunately I don't have the skills or time (yet) to do it myself :(
ok you're right I was somewhat fuzzy in my language. The models where trained quantization-aware (i.e. prepared for later quantization), but come also as unquantized (b)f16.
then, google already did some work and quantized them to e.g. Q4 GGUF. this leads to losses due to the GGUF-quantization. now the GGUF->AWQ/CT conversion may not lead to new losses (I think I misinterpreted that part, but not sure, since you're doing another round of Calibration, too?) (except when using Flax?), but are based on GGUF.
However, one could also go straight from Unquantized -> AWQ/CT, which comes with its own quantization algorithm (and calibration) that leads to a different loss then GGUF-quantization. And according to some benchmarks they might even sometimes be favorable, which is why my intuition was to prefer those.
Maybe I've got it wrong, idk.