Any chance to re-do the quants for MLA?

#1
by Panchovix - opened

Hi there, thanks for your work quantizing. I was wondering if it was possible to re-do these quants with latest mainline llamacpp, as it uses MLA and it reduces the VRAM usage by a lot of the KV Cache, for example for 16K going from 80GB VRAM to 4GB.

Thanks in advance!

DevQuasar org

Thanks for the suggestion.
Yes it's possible.
Just for my education, can you please point me which release introduced the feature?
https://github.com/ggml-org/llama.cpp/releases

Thanks! It was from this PR, that got merged 3 weeks ago.

https://github.com/ggml-org/llama.cpp/pull/12801

And specific commit was https://github.com/ggml-org/llama.cpp/commit/daa422881a0ec7944771bcc8ff8de34d11f5bd3b

Most issues mentioned after the nerge are fixed by now.

DevQuasar org

@Panchovix It's in the making (will take a while)

@csabakecskemeti amazing, many thanks! I can run Q3_K_M and Q4_K_M, so will wait patiently but expectantly for those!

DevQuasar org

@Panchovix updated Q2_K quant has uploaded. Could you please double check if this supports all feature you've mentioned? Thanks
The rest of the quants are under upload.

DevQuasar org

Q3, Q4 has re-uploaded too

Many thanks! I will download Q4_K_M and let you know how it goes!
EDIT: It works fine with MLA + FA, many thanks!

Just finished the q5 an q6 quants, by tomorrow those should be reuploaded as well!
Any need for q8?

Many thanks!!!

Personally I can't run Q8, but not sure how many people want it either D:

@Panchovix Is the Q4_K any good?

I ran the Q2_K and found the code output worse than a UD-1s of the full DeepSeek-V3.5 model.

P.S. Thanks for the MLA quants @csabakecskemeti !

@gghfez I feel it is better than the 1 bit quants, but not above q2_k_xl quant.

DevQuasar org

Q6 will be updated by tomorrow

Sign up or log in to comment