What are folks opinion on 4KM quants? Are they viable?
Question in discussion title. I really want the whole model to fit on a 3090 without offloading to ram to get those legendary token generation speeds. I have heard that quality may be low below 6 or 5 quants. Has anyone tried a 4KM quant yet? I have a really bad rural internet connection so would really appreciate some feedback.
I have some KLD statistics on various ~4bpw quants for this model, and for mainline llama.cpp/ollama/lmstudio/koboldcpp etc you can't go wrong with bartowski's Q4_K_S
or Q4_K_M
models both seem like solid performers for the 3090 24GB VRAM club (like myself).
If you're adventurous there are some improvements happening with new iqN_k
quants that don't exist for mainline llama.cpp flavors or you can also use ik_llama.cpp for with the bartowski quants you already have downloaded for some speedups as well (without waiting for the internet).
Have fun!