UD version for the Q5, Q6 and Q8 quant

#11
by nobita3921 - opened

Why does the model not have the UD version of Q5, Q6 and Q8 quant like Gemma-3 models? And what is the difference between Q8_0 and UD_Q8_K_XL ?

Unsloth AI org

Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.

UD Q8 is better than normal Q8

Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.

UD Q8 is better than normal Q8

By the way, it appears the context length is not set correctly.

"40960"

Should be "32768" no?

(Never mind, did some research and thats +8K for a typical prompt. So all good.)

Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.

UD Q8 is better than normal Q8

Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.

Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.

Yeah, but the, accuracy gains are negligible at best and having bfloat16 weights also slow down inference as most consumer GPUs aren't designed for crunching them.

Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.

Yeah, but the, accuracy gains are negligible at best and having bfloat16 weights also slow down inference as most consumer GPUs aren't designed for crunching them.

Then you should use UD Q6, since it would use Q8 / BF16 for some weights, so it would be very close to normal Q8 in quality and still smaller.

This comment has been hidden (marked as Resolved)

https://huggingface.co/posts/wolfram/819510719695955?image-viewer=819510719695955-BF854EB8D3AE3E1937FDE5CDB709F392C964BE24
So impressive with the performance of Qwen3-30B-A3B-UD-Q4_K_XL.GGUFin the benchmark. It is even better than Deepseek-V3-0324 in full precision. Seems that the performance of UD-Q4_K_X_L is so close to normal Q8_0

https://huggingface.co/posts/wolfram/819510719695955?image-viewer=819510719695955-BF854EB8D3AE3E1937FDE5CDB709F392C964BE24
So impressive with the performance of Qwen3-30B-A3B-UD-Q4_K_XL.GGUFin the benchmark. It is even better than Deepseek-V3-0324 in full precision. Seems that the performance of UD-Q4_K_X_L is so close to normal Q8_0

wondering why mlx quant yields worse quality. I've heard people talking about this...

Unsloth AI org

We've uploaded them all now

Also with a new improved calibration dataset :)

CC: @balieiro @thinkingmachines @supernovastar @dsafdf @PonderosaSharon @indrazor @eepos @CHNtentes @Dampfinchen @nobita3921 @nhbcizelexzbmnfoke @kaupane

We've uploaded them all now

Also with a new improved calibration dataset :)

CC: @balieiro @thinkingmachines @supernovastar @dsafdf @PonderosaSharon @indrazor @eepos @CHNtentes @Dampfinchen @nobita3921 @nhbcizelexzbmnfoke @kaupane

Great work ! Thank @shimmyshimmer .

nobita3921 changed discussion status to closed

Sign up or log in to comment