Other Imatrix quants (IQ3_XS) ?
Hello,
Thank you for your excellents models.
Could we get a IQ3_XXS quant of the QAT model ?
This should make it easier to squeeze the model into 16GB VRAM.
I am also waiting for the IQ3_XXS quant, as it is similar in size to Q2_K_L but stronger.
Assuming you are using mainline llama.cpp
you might be able to squeeze it into 16GB VRAM offloading just attention and kv cache to CPU while leaving everything else on GPU e.g.
-ngl 99 -ot attn=CPU -nkvo
It works with ubergarm/gemma-3-27b-it-qat-GGUF, but that quant only works with ik_llama.cpp
fork not mainline.
I'd love to hear if you get a good command line going for mainline llama.cpp with one of bartowski's quants as more people could benefit from that!
I join the request @bartowski for IQ_3* that are super-useful for 16GB VRAM.
This QAT IQ4_XS has notably sharper fact recall than IQ4_XS of the original Gemma 3 27B. Answers are richer in details and less made up facts.
Fine fine I'll make the rest haha..
Didn't seem like there should be a point, QAT shouldn't (in theory) be better across the board, but maybe it is 🤷♂️
Let me know if you find any improvements, while i would be surprised it also wouldn't be impossible!