What are folks opinion on 4KM quants? Are they viable?

#3
by Permahuman - opened

Question in discussion title. I really want the whole model to fit on a 3090 without offloading to ram to get those legendary token generation speeds. I have heard that quality may be low below 6 or 5 quants. Has anyone tried a 4KM quant yet? I have a really bad rural internet connection so would really appreciate some feedback.

I have some KLD statistics on various ~4bpw quants for this model, and for mainline llama.cpp/ollama/lmstudio/koboldcpp etc you can't go wrong with bartowski's Q4_K_S or Q4_K_M models both seem like solid performers for the 3090 24GB VRAM club (like myself).

https://www.reddit.com/r/LocalLLaMA/comments/1kcp34g/ubergarmqwen330ba3bgguf_1600_toksec_pp_105_toksec/

If you're adventurous there are some improvements happening with new iqN_k quants that don't exist for mainline llama.cpp flavors or you can also use ik_llama.cpp for with the bartowski quants you already have downloaded for some speedups as well (without waiting for the internet).

Have fun!

Sign up or log in to comment