Other Imatrix quants (IQ3_XS) ?

by deleted - opened 20 days ago

deleted

•

Hello,

Thank you for your excellents models.
Could we get a IQ3_XXS quant of the QAT model ?
This should make it easier to squeeze the model into 16GB VRAM.

BVEsun

19 days ago

I am also waiting for the IQ3_XXS quant, as it is similar in size to Q2_K_L but stronger.

ubergarm

19 days ago

Assuming you are using mainline llama.cpp you might be able to squeeze it into 16GB VRAM offloading just attention and kv cache to CPU while leaving everything else on GPU e.g.

-ngl 99 -ot attn=CPU -nkvo

It works with ubergarm/gemma-3-27b-it-qat-GGUF, but that quant only works with ik_llama.cpp fork not mainline.

I'd love to hear if you get a good command line going for mainline llama.cpp with one of bartowski's quants as more people could benefit from that!

KeyboardMasher

19 days ago

I join the request @bartowski for IQ_3* that are super-useful for 16GB VRAM.

This QAT IQ4_XS has notably sharper fact recall than IQ4_XS of the original Gemma 3 27B. Answers are richer in details and less made up facts.

bartowski

Owner 19 days ago

Fine fine I'll make the rest haha..

Didn't seem like there should be a point, QAT shouldn't (in theory) be better across the board, but maybe it is 🤷‍♂️

bartowski

Owner 18 days ago

they're up @KeyboardMasher @BVEsun @notmebug

bartowski

Owner 18 days ago

Let me know if you find any improvements, while i would be surprised it also wouldn't be impossible!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment