What are folks opinion on 4KM quants? Are they viable?

by Permahuman - opened May 1

May 1

Question in discussion title. I really want the whole model to fit on a 3090 without offloading to ram to get those legendary token generation speeds. I have heard that quality may be low below 6 or 5 quants. Has anyone tried a 4KM quant yet? I have a really bad rural internet connection so would really appreciate some feedback.

ubergarm

May 3

•

edited May 3

I have some KLD statistics on various ~4bpw quants for this model, and for mainline llama.cpp/ollama/lmstudio/koboldcpp etc you can't go wrong with bartowski's Q4_K_S or Q4_K_M models both seem like solid performers for the 3090 24GB VRAM club (like myself).

https://www.reddit.com/r/LocalLLaMA/comments/1kcp34g/ubergarmqwen330ba3bgguf_1600_toksec_pp_105_toksec/

If you're adventurous there are some improvements happening with new iqN_k quants that don't exist for mainline llama.cpp flavors or you can also use ik_llama.cpp for with the bartowski quants you already have downloaded for some speedups as well (without waiting for the internet).

Have fun!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment