Any chance to re-do the quants for MLA?
Hi there, thanks for your work quantizing. I was wondering if it was possible to re-do these quants with latest mainline llamacpp, as it uses MLA and it reduces the VRAM usage by a lot of the KV Cache, for example for 16K going from 80GB VRAM to 4GB.
Thanks in advance!
Thanks for the suggestion.
Yes it's possible.
Just for my education, can you please point me which release introduced the feature?
https://github.com/ggml-org/llama.cpp/releases
Thanks! It was from this PR, that got merged 3 weeks ago.
And specific commit was https://github.com/ggml-org/llama.cpp/commit/daa422881a0ec7944771bcc8ff8de34d11f5bd3b
Most issues mentioned after the nerge are fixed by now.
@csabakecskemeti amazing, many thanks! I can run Q3_K_M and Q4_K_M, so will wait patiently but expectantly for those!
@Panchovix
updated Q2_K quant has uploaded. Could you please double check if this supports all feature you've mentioned? Thanks
The rest of the quants are under upload.
Q3, Q4 has re-uploaded too
Many thanks! I will download Q4_K_M and let you know how it goes!
EDIT: It works fine with MLA + FA, many thanks!
Just finished the q5 an q6 quants, by tomorrow those should be reuploaded as well!
Any need for q8?
Many thanks!!!
Personally I can't run Q8, but not sure how many people want it either D:
@Panchovix Is the Q4_K any good?
I ran the Q2_K and found the code output worse than a UD-1s of the full DeepSeek-V3.5 model.
P.S. Thanks for the MLA quants @csabakecskemeti !
Q6 will be updated by tomorrow