Request! Eeek

by BlueNipples - opened May 8

May 8

•

Unsloth quants (Dynamic quants, UD for short) for particularly qwen3 30B A3B seem to be much better by everyones reckoning (including mine). But compiling the quantize.exe for their llama.cpp fork - I failed out at it pretty hard.

Imatrix doesn't run on vulkan but it seems like this does. No decoding I think. So possibly more accessible to more people than imatrix, as well as, in the metrics/benches, also better it appears?

It would be LOVELY to have stuff like the ablitirated/uncensored versions of qwen3's MoE's in all the unsloth variants at least. I know it's an ask, but if you have the know how for doing this more easily than I, it would be very much appreciated!

nicoboss

May 8

•

edited May 8

Nothing stops you from using ouer imatrix for any quants you like to create. We upload ouer imatrix file and you can use it for to generate any quant mixture you like.

We carefully selected ouer quants and are unlikely to change them anytime soon. Every few weeks someone new comes and claims they have better quants and show missloading evidence just for them to turn out equal or worse than ouer quants on closer investigation.

In any case once https://github.com/ggml-org/llama.cpp/pull/12727 gets merged there likely won't really be any difference anymore anyways. @mradermacher In case you wonder this PR is also the only reason I'm currently still waiting with DeepSeek based models.

mradermacher

Owner May 9

•

edited May 9

In any case, the correct path is to get changes into whatever upstream we use (which is currently the "official" llama.cpp). This way any changes get vetted. So if you want specific quant types, make sure they are in llama.cpp, and then we will certainly consider it - we already provide almost all quant types llama.cpp supports.

@nicoboss hmm, the way that patch seems to be going for llama.cpp to throw hands into the air, saying, just use your favourite mix by specifying it on the commandline, we do nothing?

BlueNipples

May 9

Interesting. Wasn't expecting the skepticism, but fair enough. I can't run imatrix, as I have to use vulkan which doesn't support it. I can make imatrix files, but don't get any benefit from using them (I think because there's some kind of decode in inference? Not sure). I usually use static quants. I just noticed what appeared to be a significant intelligence difference on the qwen3 MoE for their 'dynamic' quant versus the other quantizations. No worries in any case - hopefully llama.cpp will merge whatever parts of there method work, and that's useful to learn about. Thanks

Feel free to close this :)

nicoboss

May 9

You don't need to run imatrix. You can just download https://huggingface.co/mradermacher/gemma-2-abliterated-Ifable-9B-untied-i1-GGUF/blob/main/imatrix.dat and use it to run llama-quantisation using any quant mixture you like. You don't need any GPU for that so Vulkan doesn't matter. Beside that llama-imatrix works perfectly fine on Vulkan for me. If you mean running imatrix quants doesn't work on vulkan you very mistaken. I run them on my Intel Arc A770 GPUs all the time.

mradermacher

Owner May 9

imatrix quants are also not different than static quants, they just use a different tensor type mix.

BlueNipples

May 9

Thanks for your reply. Back when imatrix first came out, I was told I think by the fellow that runs llamacpp repo that it didn't work with vulkan. Or at least someone sort of official. And it was failing on my pc. But I think it might have been that guy, whatever his name is, I forget. Clearly does now (I tried it), and that's handy to know!

When you are on this lower end of gpu poor, you gotta get every squeeze of performance you can get.

mradermacher

Owner May 9

Now think of those people who have to run prompt processing on the CPU.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment