UD version for the Q5, Q6 and Q8 quant

#11

by nobita3921 - opened 8 days ago

8 days ago

Why does the model not have the UD version of Q5, Q6 and Q8 quant like Gemma-3 models? And what is the difference between Q8_0 and UD_Q8_K_XL ?

shimmyshimmer

Unsloth AI org 7 days ago

Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.

UD Q8 is better than normal Q8

Dampfinchen

7 days ago

•

edited 7 days ago

Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.

UD Q8 is better than normal Q8

By the way, it appears the context length is not set correctly.

"40960"

Should be "32768" no?

(Never mind, did some research and thats +8K for a typical prompt. So all good.)

nobita3921

4 days ago

Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.

UD Q8 is better than normal Q8

Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.

thethinkmachine

4 days ago

Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.

Yeah, but the, accuracy gains are negligible at best and having bfloat16 weights also slow down inference as most consumer GPUs aren't designed for crunching them.

CHNtentes

4 days ago

Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.

Yeah, but the, accuracy gains are negligible at best and having bfloat16 weights also slow down inference as most consumer GPUs aren't designed for crunching them.

Then you should use UD Q6, since it would use Q8 / BF16 for some weights, so it would be very close to normal Q8 in quality and still smaller.

nhbcizelexzbmnfoke

4 days ago

This comment has been hidden (marked as Resolved)

nobita3921

4 days ago

https://huggingface.co/posts/wolfram/819510719695955?image-viewer=819510719695955-BF854EB8D3AE3E1937FDE5CDB709F392C964BE24
So impressive with the performance of Qwen3-30B-A3B-UD-Q4_K_XL.GGUFin the benchmark. It is even better than Deepseek-V3-0324 in full precision. Seems that the performance of UD-Q4_K_X_L is so close to normal Q8_0

CHNtentes

4 days ago

https://huggingface.co/posts/wolfram/819510719695955?image-viewer=819510719695955-BF854EB8D3AE3E1937FDE5CDB709F392C964BE24
So impressive with the performance of Qwen3-30B-A3B-UD-Q4_K_XL.GGUFin the benchmark. It is even better than Deepseek-V3-0324 in full precision. Seems that the performance of UD-Q4_K_X_L is so close to normal Q8_0

wondering why mlx quant yields worse quality. I've heard people talking about this...

shimmyshimmer

Unsloth AI org 3 days ago

We've uploaded them all now

Also with a new improved calibration dataset :)

CC: @balieiro @thinkingmachines @supernovastar @dsafdf @PonderosaSharon @indrazor @eepos @CHNtentes @Dampfinchen @nobita3921 @nhbcizelexzbmnfoke @kaupane

nobita3921

2 days ago

We've uploaded them all now

Also with a new improved calibration dataset :)

CC: @balieiro @thinkingmachines @supernovastar @dsafdf @PonderosaSharon @indrazor @eepos @CHNtentes @Dampfinchen @nobita3921 @nhbcizelexzbmnfoke @kaupane

Great work ! Thank @shimmyshimmer .

nobita3921 changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment