IQ2_KS
Thanks for doing these, I'm looking forward to trying this model!
Are you doing an IQ2_KS for this one?
(I'm using your IQ2_KS for the previous release with 256GB RAM + 6x24GB VRAM)
Is IQ2_ks good enough for you in terms of quality?
I'll raise my hand for IQ2_KS
as well. :-)
Is IQ2_ks good enough for you in terms of quality?
I don't know yet for this one, but for K2, yes. Specifically Ubergarm's IQ2_ks is the only way I can run it locally without it being obviously lobotomized.
That quant/model is able to find logic issues in my fairly bespoke coding projects that Opus 4.1 misses and it's my favorite model for creative writing.
I just tried out the unsloth IQ2_XXS regenerating the last response in my K2 chats and it's a lot worse. Misses bugs K2 found, inattentive for creative writing, etc. It also uses more memory / I have to place more tensors on CPU.
Hopefully an IQ2_KS will be as great as the K2 one.
Dealing with some hardware stuff, but got the imatrix uploaded, I'll prioritize cooking the IQ2_KS first and then do some other sizes.
Thanks and appreciate the feedback!
Also heads up
@Thireus
- the new imatrix is up as you saw already, but while using it now I notice it is missing importance weights for the first ffn_(gate|down|up)
dense layer (blk 0 only on Kimi-K2) as well as the shared expert ffn_(gate|down|up)_shexp
. I'll be leaving those all full q8_0 for this round given that, and probably leave the attn all q8_0 as well just given it is a small percentage of overall weights more or less and the original seemed quite sensitive to quantization there.
example messages during quantizing:
====== llama_model_quantize_internal: did not find weights for blk.0.ffn_gate.weight
...
====== llama_model_quantize_internal: did not find weights for blk.56.ffn_up_shexp.weight
Seems to have everything it needs for the routed exps which are the most important given we're quantizing those the most.
Also I was unable to run imatrix with --layer-importance
as it gave this error:
llama_kv_cache_init: CPU KV buffer size = 34.31 MiB
llama_new_context_with_model: KV self size = 34.31 MiB, c^KV (f16): 34.31 MiB, kv^T: not used
llama_new_context_with_model: CPU output buffer size = 0.63 MiB
llama_new_context_with_model: CPU compute buffer size = 334.00 MiB
llama_new_context_with_model: graph nodes = 3340
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 192 (n_threads_batch = 384) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 551.937 ms
compute_imatrix: computing over 826 chunks with batch_size 512
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
======================================= HAVE_FANCY_SIMD is defined
Oops, inconsistent ffn vs last_input size
This Oops
may be related to the missing importance weights above, but didn't have time to try to debug further.
fwiw I used the triton-cpu
method to fp8 to bf16 cast the safetensors. Then I used mainline llama.cpp convert_hf_to_gguf.py
and then switch over to ik_llama.cpp
for quantizing the pure q8_0, getting imatrix from it, then quantizing the rest now from bf16 gguf.
Okie folks, first one is uploaded: IQ2_KS 289.820 GiB (2.425 BPW) !!!
It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps. I'll have a few sizes up available by later today if all goes well.
Cheers!
Okie folks, first one is uploaded: IQ2_KS 289.820 GiB (2.425 BPW) !!!
It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps. I'll have a few sizes up available by later today if all goes well.
Cheers!
Awesome! Thank you π do you know if --jinja (tool calling) will work for this? Is there a sample startup command you have that you can please so kindly share?
I recall there were some differences between first few layers being offloaded on the GPU using -ot parameter between Qwen and DS. This follows more like DS or Qwen in terms of offloading in terms of offloading first few layers.
Awesome! Thank you π do you know if --jinja (tool calling) will work for this? Is there a sample startup command you have that you can please so kindly share?
It should work, and you can pass in the official kimi jinja file (or any edited or updated custom one you find or have) as well probably if u want: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/chat_template.jinja
Read this PR for details on which endpoints support what kind of completions and tool calling: https://github.com/ikawrakow/ik_llama.cpp/pull/723 and be mindful of if you're using /v1/completions
or /completions
etc as they seem to have different behavior I suppose. Still need to figure out the exact details myself, sorry now easy examples for you.
I recall there were some differences between first few layers being offloaded on the GPU using -ot parameter between Qwen and DS. This follows more like DS or Qwen in terms of offloading in terms of offloading first few layers.
Yes this is the basic idea so far:
- DeepSeek has 3 dense layers and 1 shared expert
- Kimi-K2 has 1 dense layer and 1 shared expert
- Qwen has no dense layers and no shared expert
So tl;dr; for kimi-k2 start with blk 1 for the -ot ...
e.g. ` refer to the model card quick start examples for a two GPU example to get you started dialing it in for your rig.
Thank you so much @ubergarm !
i was remotely able to queue download on my AI rig while i am on plane returning back from a work trip. It should be ready by the time i land. Cant wait to go and play with it once i reach home!
It is only 350 gigs, currently I have 512GB of RAM and 5x5090s. I will try to offload as much as I can and see how it goes.
My fav model has been Qwen3-235B thinking with 256K that perfectly fits on my GPU in IQ4-XS format up untill now, expecially because of the speed and quality that I get from it. But the difference between 1T and 235B is huge. Let me see if I can sacrifice the speed for quality :)
currently I have 512GB of RAM and 5x5090s
Oh interesting u kept the 5090s and sold the 6000 Pros then? I am uploading smol-IQ4_KSS
now and cooking one size bigger smol-IQ5_KS
to release later today so you can try whatever you like. Perplexity graph will be filled in slowly and the baseline Q8_0 perplexity will take some time to calculate probably overnight later for the "baseline".
Definitely keep us posted on your speed vs accuracy experience playing with the new models and if you get a good command for tool calling going (and possibly draft model as well hah)
I indeed did. That was a very careful decision after a lot of thinking. With PCIE5, Intel Xeon Sapphire Rapids I was not worry about memory bandwidth. And for the price of 1 6000 Pro, i was able to squeeze in 5X5090s. And yes I got a nice deal on my 5090s and a nice sell on my 6000 pros for this to work :)
I will try some other models too! Disk space is slowly becoming an issue now though! lol two4TB SSDs and one2TB SSD is almost full. And it is so heartbreaking to delete old models!
@ubergarm
Hi John
Great work as always!
It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps.
For my education, why are you giving it full Q8_0 instead of e.g. Q6_K on a IQ2_KS? Is it something specific with Kimi K2, or is it something you generally do?
I got it running and i am getting pretty decent speeds on the IQ2_KS. I am getting 12-14 tk/s . Below is the server startup command i used.
CUDA_VISIBLE_DEVICES="0,1,2,3,4" ./build/bin/llama-server \
--model /media/mukul/t7/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/IQ2_KS/Kimi-K2-Instruct-0905-IQ2_KS-00001-of-00007.gguf \
--alias ubergarm/IQ2_KS/Kimi-K2-Instruct-0905 \
--ctx-size 65536 \
-ctk q8_0 \
-fa -fmoe \
-mla 3 \
-b 4096 -ub 4096 \
-ngl 99 \
-ot "blk\.([1-3])\.ffn=CUDA0" \
-ot "blk\.([4-6])\.ffn=CUDA1" \
-ot "blk\.([7-9])\.ffn=CUDA2" \
-ot "blk\.(1[0-3])\.ffn=CUDA3" \
-ot "blk\.(1[4-6])\.ffn=CUDA4" \
-ot exps=CPU \
--parallel 1 \
--threads 56 \
--threads-batch 64 \
--host 0.0.0.0 \
--port 10002
below is the VRAM consumption:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05 Driver Version: 580.76.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:16:00.0 Off | N/A |
| 33% 54C P1 84W / 400W | 26502MiB / 32607MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:40:00.0 Off | N/A |
| 0% 43C P1 83W / 400W | 25502MiB / 32607MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 Off | 00000000:6A:00.0 On | N/A |
| 0% 44C P1 96W / 400W | 26538MiB / 32607MiB | 5% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 Off | 00000000:94:00.0 Off | N/A |
| 30% 47C P1 80W / 400W | 30532MiB / 32607MiB | 7% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 5090 Off | 00000000:BF:00.0 Off | N/A |
| 0% 48C P1 87W / 400W | 26444MiB / 32607MiB | 5% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Working in Roo really well. 2x RTX 6000 setup = ~25t/s
Working in Roo really well. 2x RTX 6000 setup = ~25t/s
tacos4me That is quite impressive, what is the startup command you used and what is the rest of your machine configuration please?
EPYC 9115 + 12x64GB-5600
numactl -N 0 -m 0
./build/bin/llama-server
--model "$model"
--alias ubergarm/Kimi-K2-Instruct-0905
--ctx-size 98304
-ctk q8_0
-fa -fmoe
-mla 3
-ngl 99
-ot "blk.(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16).ffn_.=CUDA0"
-ot "blk.(19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34).ffn_.=CUDA1"
-ot exps=CPU
--parallel 1
--threads 24
--threads-batch 48
--numa numactl
--host 0.0.0.0
--port 8080
Hmm, interesting that you are able to offload 34 layers on 196 GB of VRAM and I am only able to offload 16 layers on 160 GB of vram, what magic did you do there! please advice!
EDIT: i noticed that you did not offload layers 17-18 so technically you offloaded 32 layers. But it is still quite large!
Try building ik_llama with these?
-DGGML_CUDA=ON
-DGGML_BLAS=OFF
-DCMAKE_CUDA_ARCHITECTURES="120"
-DGGML_MAX_CONTEXTS=2048
-DGGML_AVX=ON
-DGGML_SCHED_MAX_COPIES=1
-DGGML_AVX2=ON
Probably -DGGML_SCHED_MAX_COPIES=1
IQ2_KS was a drop in replacement for me, thanks for the quick quantization!
3x3090, 1x4090, 256gb DDR5 Intel Sapphire Rapids QYFS
16t/s, standard // 21t/s with -ser 4,1!
also, for kimi, I've found I get better PP without the "-b 4096 -ub 4096" commands, which is strange because that command usually boosts my PP by a ton on other models... kimi has always had rather slow PP for me, right now it's around 75t/s
Time to put this model through its paces -
Yay, glad to see it is running for folks! Having a network issue right now, once I can get into the remote rig again will try to finish up a few more sizes for now then consider any "specialty" quants.
For my education, why are you giving it full Q8_0 instead of e.g. Q6_K on a IQ2_KS? Is it something specific with Kimi K2, or is it something you generally do?
Good question. Its all trade-offs. Here are a few things going through my mind when making these decisions:
- The routed experts are a bulk of the model size and most important to quantize to reduce overall size. The attn/first dense layer/shexp are relatively small in terms of overall model size so okay to leave them larger q8_0.
- Kimi-K2 is about the biggest open weights model, so people who have enough RAM+VRAM to use it probably will have at least 24GB. This is not always true, but given this using full Q8_0 for attn/first dense layer/shexp is fine.
- There will be some penalty to TG given for the most part TG is memory bandwidth limited and keeping attn/first dense layer/shexp at full q8_0 does increase the overall size of the active parameters during TG. But given they are offloaded onto GPU VRAM anyway it probably won't bottleneck as bad as the system RAM with slower memory bandwidth.
- For whatever reason I've yet to go back and look into the imatrix issue and why there doesn't seem to be data for the first dense layer and shexp. Because of this in this specific case I'm loathe to quantize those tensors given they won't have imatrix data. This is an unusual circumstance just in this one case and not a general issue.
- Assuming I did have imatrix data for those, i typically like to go iq5_ks for those somtimes or iq6_k etc.
- As for attention tensors, generally we're looking at (q|k|v|o) and generally (k|v) are very small so not a ton of benifit from shrinking those anyway. The usual advice is keep (k|v) one step up from (q|o) and so I could consider going with like iq6_k for attn_output especially as it is fairly large relative to the others. This is an MLA quant though so not quite as simple as dense model with "normal" (q|k|v|o) attention tensors. So easier to keep them q8_0 for full quality and minimize perplexity without a huge TG speed trade-off for many cases.
- In the past on the earlier Kimi-K2 I had quantized some of these tensors more, but then compared and felt like this model is kind of sensitive as it has only one first dense layer and a single shared expert and decided to go in more on the quality side of the trade-off vs the speed side.
I forget who it was who had only 12 or 16GB VRAM but a bunch of RAM, but for a system like that with a single 3060TI 16GB or 5080TI etc they might benefit from some quantizing here just to fit more kv-cache. So if there is a request for a "speed mix" or similar I could try to quantize those tensors a little bit to go towards the speed side of the trade-off at the cost of quality.
Finally, generally Bartowski, mradermacher, and unsloth use more "standard" blends built into llama-cpp so they cover a lot of those quantization recipes already. Given I only work with --custom-q
style quantization and recipes I try to specialize in max quality for a given memory footprint.
Probably -DGGML_SCHED_MAX_COPIES=1
Interestingly ik made that the default just this past week: https://github.com/ikawrakow/ik_llama.cpp/pull/751
Definitely a must for MLA models running on multi-GPU configurations to avoid OOMing with huge CUDA buffers. Its fine to leave it explicit though!
IQ2_KS was a drop in replacement for me, thanks for the quick quantization!
3x3090, 1x4090, 256gb DDR5 Intel Sapphire Rapids QYFS
16t/s, standard // 21t/s with -serv 4,1!
also, for kimi, I've found I get better PP without the "-b 4096 -ub 4096" commands, which is strange because that command usually boosts my PP by a ton on other models... kimi has always had rather slow PP for me, right now it's around 75t/s
Time to put this model through its paces -
Lol now this is bothering me!
Why am I unable to get such speeds :)
I have QYFS myself. And 5x5090
512GB 4800 MHz DDR5.
Would you please be able to give me your build and start-up command?
I'll be honest I haven't rebuilt my ik_llama in weeks, here's my startup command... maybe drop the " -b 4096 -ub 4096", lower your ctx to 32k, or keep 64k ctx and quantize it to q4_0, shove more layers onto your gpus. you're only getting 4 more layers than me... also are all of your mobo ram slots full? 8 channel mobo?
/home/phone/Documents/ik_llama.cpp/build/bin/llama-server
--model /home/phone/Downloads/LocalModels/Kimi-K2-Instruct-0905-IQ2_KS-00001-of-00007.gguf
--alias ubergarm/Kimi-K2-Instruct-0905-IQ2_KS
--ctx-size 32768
-ctk q8_0
-mla 3 -fa -fmoe
-ngl 99
-ot "blk.(1|2|3).ffn_.=CUDA0"
-ot "blk.(4|5|6).ffn_.=CUDA1"
-ot "blk.(7|8|9).ffn_.=CUDA2"
-ot "blk.(10|11|12).ffn_.=CUDA3"
-ot exps=CPU
--parallel 1
--threads 48
--threads-batch 56
--host 0.0.0.0
--port 8081
--no-mmap
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05 Driver Version: 580.76.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:16:00.0 On | Off |
| 0% 36C P8 21W / 550W | 21571MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:40:00.0 Off | N/A |
| 0% 51C P8 19W / 350W | 21044MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 Off | 00000000:6A:00.0 Off | N/A |
| 0% 48C P8 8W / 350W | 21044MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 Off | 00000000:94:00.0 Off | N/A |
| 40% 37C P8 25W / 350W | 21598MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Given I only work with --custom-q style quantization and recipes I try to specialize in max quality for a given memory footprint.
That is why I asked you. π I'm using this for translations, so for me quality is also more of a concern than speed.
Here are a few things going through my mind when making these decisions ...
felt like this model is kind of sensitive as it has only one first dense layer and a single shared expert and decided to go in more on the quality side of the trade-off vs the speed side
This makes so much sense. Thank you!
EPYC 9115 + 12x64GB-5600
@tacos4me I've been thinking of getting an Epyc 9124 (+motherboard, + ram) to replace my painfully slow 2 channel memory motherboard.
I'm a newb when it comes to server-grade hardware. Would you or anyone else have some tips?
Do you have a single CPU or dual? What kind of motherboard do you have?
Thanks!
EPYC 9115 + 12x64GB-5600
@tacos4me I've been thinking of getting an Epyc 9124 (+motherboard, + ram) to replace my painfully slow 2 channel memory motherboard.
I'm a newb when it comes to server-grade hardware. Would you or anyone else have some tips?
Thanks!
Take a look at this channel and this video :) no self promotion I promise! ;)
https://www.youtube.com/watch?v=Xui3_bA26LE&list=PLteHam9e1Fecmd4hNAm7fOEPa4Su0YSIL&index=8
Take a look at this channel and this video :) no self promotion I promise! ;)
Thanks, will check it out!
EPYC 9115 + 12x64GB-5600
@tacos4me I've been thinking of getting an Epyc 9124 (+motherboard, + ram) to replace my painfully slow 2 channel memory motherboard.
I'm a newb when it comes to server-grade hardware. Would you or anyone else have some tips?
Do you have a single CPU or dual? What kind of motherboard do you have?
Thanks!
The config of mine isn't even the greatest for straight CPU inference. Have builds on both GENOAD8UD-2T/X550 and MBD-H13SSL-N. You want to populate as many memory channels as ya can, with quickest memory you can. In a nutshell.
Be mindful of the CCD count on the CPUs you're looking at. If going to all the trouble of a dual socket system, you really want to be using a chip with 8+ CCD for best results, I believe. Which drives cost up even further. 9175F/9455 perhaps, but you're talking 2-3k/CPU if you want all the memory bandwidth.
Thanks mate, this is awesome. The model has more world knowledge than the previous one (it knows who some random people I know IRL are), and this quant is better than the UD-2xxs.
I agree with the guy who said it's a "drop in replacement" as well. That's another thing I like about your quants, they're about the same size within each family of models. eg. I can swap out the different Deepseeks without messing around with the -ot lines.
Also, I reckon they're training a reasoning model soon because when I gave it a complex bug in a neural codec trainer, it started adding "but wait, ..." and back-tracking in it's reply.
It also knows how to follow a and close a reasoning chain in /completion if I prefix the reply with Okay the user ", without being told to do so.
Nice! Yeah your table tells the story of how much bigger the routed experts are than everything else in these big MoEs! It is a bit trickier to calculate the quantized size of "active experts" which does effect TG (token generation) speed.
Yeah as @tacos4me says essentially memory bandwidth is the name of the game to get more tok/sec generation speed with these big MoEs. What does your current 2 channel system get with mlc (intel memory latency checker) or similar (e.g. aida64 in windows)? You might be able to squeeze some more out of it tuning RAM without spending cash (though spending time). About the most you'll get with 2-channel AM5 "gamer rig" system is likely 90GB/s with 2 dimms and if you're lucky maybe 75GB/s with 4x dimms I'm guessing.
Once you go to larger EPYC or Xeon systems though you'll have to consider NUMA stuff and memory not local to the CCDs incurs a penalty - especially so for cross-socket dual CPU systems. Its not a well solved problem with llama cpp flavored inference engines. sglang had a recent paper about trying to make somewhat better for specific quantization dtypes e.g. int8 on newer saphire rappids AMX intel xeon systems.
Wendell of level1techs (who is sponsoring the hardware I'm using to cook these quants) has a recent video about an interesting combination of single socket EPYC 9575f and 12-ch RAM https://youtu.be/bOxAdRfTpJg?t=68 given that CPU has two GMI links per chiplet so it might be able to get the most memory bandwidth in a single NUMA node (NPS1). Though I have heard some reports from folks on ai beavers discord they were getting faster on NPS4 but I'm not sure how/why exactly or the details.
Anyway, have fun building a rig! The price jump from AM5 class gamer rigs to EPYC server rigs is pretty big though and geting 12x sticks of DDR5-6000+ is pricy!
Thank you @ubergarm , @tacos4me and @mtcl for all your replies.
I've got plenty to research and to learn about from what you've mentioned. The number of CCDs, dual CPUs, GMI links are all things I weren't aware of.
If I were to just look at the spec sheets, I might have made an expensive mistake!
Given my budget probably doesn't extend far enough to reach a 9005 (Turin) with 8 CCDs, I might look for a 9004 (Genoa) such as a 9554...
FWIW, I did some perplexity runs with reduced experts.
Baseline: Final estimate: PPL = 3.2488 +/- 0.01723
ser 7,1 : Final estimate: PPL = 3.2455 +/- 0.01695
ser 6,1 : Final estimate: PPL = 3.2829 +/- 0.01694
ser 5,1 : Final estimate: PPL = 3.4187 +/- 0.01753
Running at -ser 7,1
improves TG on my rig by ~20%. YMMV!