Any chance to re-do the quants for MLA?

by Panchovix - opened May 6

May 6

Hi there, thanks for your work quantizing. I was wondering if it was possible to re-do these quants with latest mainline llamacpp, as it uses MLA and it reduces the VRAM usage by a lot of the KV Cache, for example for 16K going from 80GB VRAM to 4GB.

Thanks in advance!

csabakecskemeti

DevQuasar org May 7

Thanks for the suggestion.
Yes it's possible.
Just for my education, can you please point me which release introduced the feature?
https://github.com/ggml-org/llama.cpp/releases

Panchovix

May 7

Thanks! It was from this PR, that got merged 3 weeks ago.

https://github.com/ggml-org/llama.cpp/pull/12801

Panchovix

May 7

And specific commit was https://github.com/ggml-org/llama.cpp/commit/daa422881a0ec7944771bcc8ff8de34d11f5bd3b

Most issues mentioned after the nerge are fixed by now.

csabakecskemeti

DevQuasar org May 11

@Panchovix It's in the making (will take a while)

Panchovix

May 11

@csabakecskemeti amazing, many thanks! I can run Q3_K_M and Q4_K_M, so will wait patiently but expectantly for those!

csabakecskemeti

DevQuasar org May 12

@Panchovix updated Q2_K quant has uploaded. Could you please double check if this supports all feature you've mentioned? Thanks
The rest of the quants are under upload.