Add comparison with Q_K quants

#1
by AliNT99 - opened

Thank you for your awesome work!

is it possible for you to also benchmark some unsloth or bartowski QK quants to see how these IQK quants compare in terms of PPL/Size?

and also what is the exact command for reproducing the ppl results?

Thank you for your awesome work!

is it possible for you to also benchmark some unsloth or bartowski QK quants to see how these IQK quants compare in terms of PPL/Size?

and also what is the exact command for reproducing the ppl results?

bartowski may welcome you, but Unsloth seems don't like comparison (from their comments, maybe I misinterpreted)

So bartowski and I work fairly closely together and share information. He sticks to the mainline llama.cpp quantizations and I stick to ik_llama.cpp quantizations. Unsloth is a good bunch with interesting approaches as well, though it can be difficult to know their exact methodologies in my experience.

We're all working together to push the pareto front and bring high quality LLMs to home rig users. My advice is to find a quant that barely fits into your hardware RAM+VRAM for the desired context length and go with that. All the quants are pretty good honestly as shown in this benchmarking reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1khwxal/the_great_quant_wars_of_2025/

fwiw I did just compare the new Intel "auto-round" quant and it doesn't look like anything special compared to ik_llama.cpp SOTA quants:

ppl-Qwen3-30B-A3B-Instruct-2507.png

Thanks for another great quant!

Do you have access to multiple big systems? My single machine can no longer keep up with the releases: they come out faster than I can quantize them.

@anikifoss

Yes, I have access to a couple large systems. One with Thread Ripper Pro 24x Core with 256GB RAM and 2x RTX A6000 (the older non-pro non-blackwell ones with 48GB VRAM each). Another is a huge dual socket AMD epyc with almost 1.5TB RAM total where I just crunch everything CPU only haha...

Right and omg they just released Qwen3-30B-A3B-Thinking now so my day is shot lol let alone if we get GLM4.5 support in llama.cpp eventually haha

Yes, I have access to a couple large systems. One with Thread Ripper Pro 24x Core with 256GB RAM and 2x RTX A6000 (the older non-pro non-blackwell ones with 48GB VRAM each). Another is a huge dual socket AMD epyc with almost 1.5TB RAM total where I just crunch everything CPU only haha...

Those are nice! I need to copy your SSD raid setup you posted on Level1Tech to keep up with all the releases.

Right and omg they just released Qwen3-30B-A3B-Thinking now so my day is shot lol let alone if we get GLM4.5 support in llama.cpp eventually haha

Yeah, GLM-4.5 looks really promising, can't wait to start playing with it!

Just tested Unsloth's Q4_K_M and UD against IQ3_K and IQ4_KSS, here are the results:

-f wiki.test.raw -c 512 --seed 1337 -b 4096 -ub 1024 -ctk q8_0 -ctv q8_0

UD-Q4_K_XL model size = 16.470 GiB (4.634 BPW) :
Final estimate: PPL = 7.4523 +/- 0.05223

Q4_K_M model size = 17.277 GiB (4.861 BPW) :
Final estimate: PPL = 7.4360 +/- 0.05219

IQ4_KSS model size = 15.531 GiB (4.370 BPW) :
Final estimate: PPL = 7.4102 +/- 0.05198

IQ3_K model size = 14.509 GiB (4.082 BPW) :
Final estimate: PPL = 7.4818 +/- 0.05250

@AliNT99

Thanks, so let's give unsloth the benfit of the doubt and if you test with unquantized kv cache e.g. -ctk f16 -ctv f16 like I do for reporting mine, my IQ4_KSS is still looking really good. I'm liking that sweet spot which still allows full offload on 24GB VRAM GPU like my 3090TI with almost 40k q8_0 context or more at q6_0 hah..

Here is some more details on how I've been measring my perplexity: https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/6#688a694db111f4c7d34226df

Interestingly measuring PPL on my CPU-only backend seems to give slightly higher values than on CUDA anecdotally as your numbers are a bit smaller than mine despite q8_0 kv cache !!

Sign up or log in to comment