UD version for the Q5, Q6 and Q8 quant
Why does the model not have the UD version of Q5, Q6 and Q8 quant like Gemma-3
models? And what is the difference between Q8_0 and UD_Q8_K_XL ?
Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.
UD Q8 is better than normal Q8
Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.
UD Q8 is better than normal Q8
By the way, it appears the context length is not set correctly.
"40960"
Should be "32768" no?
(Never mind, did some research and thats +8K for a typical prompt. So all good.)
Hi there good suggestion, there's no paticular reason why. It's because we forgot to do it and it was time consuming. We'll do it thanks to your suggestion.
UD Q8 is better than normal Q8
Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.
Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.
Yeah, but the, accuracy gains are negligible at best and having bfloat16
weights also slow down inference as most consumer GPUs aren't designed for crunching them.
Yeah, I checked your UD Q8 and normal Q8. With your UD Q8, you use BF16 for some weight matrices as embedding, Q, K, up, down, gate matrix, etc. Meanwhile, the normal Q8_0 just uses Q8_0 for these matrices. So, this is why your UD Q8 is larger but better than one.
Yeah, but the, accuracy gains are negligible at best and having
bfloat16
weights also slow down inference as most consumer GPUs aren't designed for crunching them.
Then you should use UD Q6, since it would use Q8 / BF16 for some weights, so it would be very close to normal Q8 in quality and still smaller.
https://huggingface.co/posts/wolfram/819510719695955?image-viewer=819510719695955-BF854EB8D3AE3E1937FDE5CDB709F392C964BE24
So impressive with the performance of Qwen3-30B-A3B-UD-Q4_K_XL.GGUF
in the benchmark. It is even better than Deepseek-V3-0324
in full precision. Seems that the performance of UD-Q4_K_X_L is so close to normal Q8_0
https://huggingface.co/posts/wolfram/819510719695955?image-viewer=819510719695955-BF854EB8D3AE3E1937FDE5CDB709F392C964BE24
So impressive with the performance ofQwen3-30B-A3B-UD-Q4_K_XL.GGUF
in the benchmark. It is even better thanDeepseek-V3-0324
in full precision. Seems that the performance of UD-Q4_K_X_L is so close to normal Q8_0
wondering why mlx quant yields worse quality. I've heard people talking about this...
We've uploaded them all now
Also with a new improved calibration dataset :)
CC: @balieiro @thinkingmachines @supernovastar @dsafdf @PonderosaSharon @indrazor @eepos @CHNtentes @Dampfinchen @nobita3921 @nhbcizelexzbmnfoke @kaupane
We've uploaded them all now
Also with a new improved calibration dataset :)
CC: @balieiro @thinkingmachines @supernovastar @dsafdf @PonderosaSharon @indrazor @eepos @CHNtentes @Dampfinchen @nobita3921 @nhbcizelexzbmnfoke @kaupane
Great work ! Thank @shimmyshimmer .