ubergarm/Kimi-K2-Instruct-0905-GGUF

2 days ago

Thanks for doing these, I'm looking forward to trying this model!
Are you doing an IQ2_KS for this one?
(I'm using your IQ2_KS for the previous release with 256GB RAM + 6x24GB VRAM)

mtcl

2 days ago

Is IQ2_ks good enough for you in terms of quality?

whoisjeremylam

2 days ago

I'll raise my hand for IQ2_KS as well. :-)

gghfez

2 days ago

Is IQ2_ks good enough for you in terms of quality?

I don't know yet for this one, but for K2, yes. Specifically Ubergarm's IQ2_ks is the only way I can run it locally without it being obviously lobotomized.

That quant/model is able to find logic issues in my fairly bespoke coding projects that Opus 4.1 misses and it's my favorite model for creative writing.

I just tried out the unsloth IQ2_XXS regenerating the last response in my K2 chats and it's a lot worse. Misses bugs K2 found, inattentive for creative writing, etc. It also uses more memory / I have to place more tensors on CPU.

Hopefully an IQ2_KS will be as great as the K2 one.

ubergarm

Owner 2 days ago

Dealing with some hardware stuff, but got the imatrix uploaded, I'll prioritize cooking the IQ2_KS first and then do some other sizes.

Thanks and appreciate the feedback!

ubergarm

Owner 1 day ago

Also heads up @Thireus - the new imatrix is up as you saw already, but while using it now I notice it is missing importance weights for the first ffn_(gate|down|up) dense layer (blk 0 only on Kimi-K2) as well as the shared expert ffn_(gate|down|up)_shexp. I'll be leaving those all full q8_0 for this round given that, and probably leave the attn all q8_0 as well just given it is a small percentage of overall weights more or less and the original seemed quite sensitive to quantization there.

example messages during quantizing:

====== llama_model_quantize_internal: did not find weights for blk.0.ffn_gate.weight
...
====== llama_model_quantize_internal: did not find weights for blk.56.ffn_up_shexp.weight

Seems to have everything it needs for the routed exps which are the most important given we're quantizing those the most.

Also I was unable to run imatrix with --layer-importance as it gave this error:

llama_kv_cache_init:        CPU KV buffer size =    34.31 MiB
llama_new_context_with_model: KV self size  =   34.31 MiB, c^KV (f16):   34.31 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     0.63 MiB
llama_new_context_with_model:        CPU compute buffer size =   334.00 MiB
llama_new_context_with_model: graph nodes  = 3340
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 192 (n_threads_batch = 384) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 551.937 ms
compute_imatrix: computing over 826 chunks with batch_size 512
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
======================================= HAVE_FANCY_SIMD is defined
Oops, inconsistent ffn vs last_input size

This Oops may be related to the missing importance weights above, but didn't have time to try to debug further.

fwiw I used the triton-cpu method to fp8 to bf16 cast the safetensors. Then I used mainline llama.cpp convert_hf_to_gguf.py and then switch over to ik_llama.cppfor quantizing the pure q8_0, getting imatrix from it, then quantizing the rest now from bf16 gguf.

Thireus

1 day ago

@ubergarm , thanks for the heads up!

ubergarm

Owner 1 day ago

@gghfez @mtcl @whoisjeremylam

Okie folks, first one is uploaded: IQ2_KS 289.820 GiB (2.425 BPW) !!!

It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps. I'll have a few sizes up available by later today if all goes well.

Cheers!

mtcl

1 day ago

@gghfez @mtcl @whoisjeremylam

Okie folks, first one is uploaded: IQ2_KS 289.820 GiB (2.425 BPW) !!!

It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps. I'll have a few sizes up available by later today if all goes well.

Cheers!

Awesome! Thank you 😊 do you know if --jinja (tool calling) will work for this? Is there a sample startup command you have that you can please so kindly share?

I recall there were some differences between first few layers being offloaded on the GPU using -ot parameter between Qwen and DS. This follows more like DS or Qwen in terms of offloading in terms of offloading first few layers.

ubergarm

Owner 1 day ago

•

edited 1 day ago

@mtcl

Awesome! Thank you 😊 do you know if --jinja (tool calling) will work for this? Is there a sample startup command you have that you can please so kindly share?

It should work, and you can pass in the official kimi jinja file (or any edited or updated custom one you find or have) as well probably if u want: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/chat_template.jinja

Read this PR for details on which endpoints support what kind of completions and tool calling: https://github.com/ikawrakow/ik_llama.cpp/pull/723 and be mindful of if you're using /v1/completions or /completions etc as they seem to have different behavior I suppose. Still need to figure out the exact details myself, sorry now easy examples for you.

I recall there were some differences between first few layers being offloaded on the GPU using -ot parameter between Qwen and DS. This follows more like DS or Qwen in terms of offloading in terms of offloading first few layers.

Yes this is the basic idea so far:

DeepSeek has 3 dense layers and 1 shared expert
Kimi-K2 has 1 dense layer and 1 shared expert
Qwen has no dense layers and no shared expert

So tl;dr; for kimi-k2 start with blk 1 for the -ot ... e.g. ` refer to the model card quick start examples for a two GPU example to get you started dialing it in for your rig.

mtcl

1 day ago

Thank you so much @ubergarm !

i was remotely able to queue download on my AI rig while i am on plane returning back from a work trip. It should be ready by the time i land. Cant wait to go and play with it once i reach home!

It is only 350 gigs, currently I have 512GB of RAM and 5x5090s. I will try to offload as much as I can and see how it goes.

My fav model has been Qwen3-235B thinking with 256K that perfectly fits on my GPU in IQ4-XS format up untill now, expecially because of the speed and quality that I get from it. But the difference between 1T and 235B is huge. Let me see if I can sacrifice the speed for quality :)

ubergarm

Owner 1 day ago

@mtcl

currently I have 512GB of RAM and 5x5090s

Oh interesting u kept the 5090s and sold the 6000 Pros then? I am uploading smol-IQ4_KSS now and cooking one size bigger smol-IQ5_KS to release later today so you can try whatever you like. Perplexity graph will be filled in slowly and the baseline Q8_0 perplexity will take some time to calculate probably overnight later for the "baseline".

Definitely keep us posted on your speed vs accuracy experience playing with the new models and if you get a good command for tool calling going (and possibly draft model as well hah)

mtcl

1 day ago

•

edited 1 day ago

I indeed did. That was a very careful decision after a lot of thinking. With PCIE5, Intel Xeon Sapphire Rapids I was not worry about memory bandwidth. And for the price of 1 6000 Pro, i was able to squeeze in 5X5090s. And yes I got a nice deal on my 5090s and a nice sell on my 6000 pros for this to work :)

I will try some other models too! Disk space is slowly becoming an issue now though! lol two4TB SSDs and one2TB SSD is almost full. And it is so heartbreaking to delete old models!

bibproj

1 day ago

@ubergarm
Hi John
Great work as always!

It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps.

For my education, why are you giving it full Q8_0 instead of e.g. Q6_K on a IQ2_KS? Is it something specific with Kimi K2, or is it something you generally do?

mtcl

1 day ago

I got it running and i am getting pretty decent speeds on the IQ2_KS. I am getting 12-14 tk/s . Below is the server startup command i used.

CUDA_VISIBLE_DEVICES="0,1,2,3,4" ./build/bin/llama-server \
  --model /media/mukul/t7/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/IQ2_KS/Kimi-K2-Instruct-0905-IQ2_KS-00001-of-00007.gguf \
  --alias ubergarm/IQ2_KS/Kimi-K2-Instruct-0905 \
  --ctx-size 65536 \
  -ctk q8_0 \
  -fa -fmoe \
  -mla 3 \
  -b 4096 -ub 4096 \
  -ngl 99 \
  -ot "blk\.([1-3])\.ffn=CUDA0" \
  -ot "blk\.([4-6])\.ffn=CUDA1" \
  -ot "blk\.([7-9])\.ffn=CUDA2" \
  -ot "blk\.(1[0-3])\.ffn=CUDA3" \
  -ot "blk\.(1[4-6])\.ffn=CUDA4" \
  -ot exps=CPU \
  --parallel 1 \
  --threads 56 \
  --threads-batch 64 \
  --host 0.0.0.0 \
  --port 10002

below is the VRAM consumption:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05              Driver Version: 580.76.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:16:00.0 Off |                  N/A |
| 33%   54C    P1             84W /  400W |   26502MiB /  32607MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090        Off |   00000000:40:00.0 Off |                  N/A |
|  0%   43C    P1             83W /  400W |   25502MiB /  32607MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5090        Off |   00000000:6A:00.0  On |                  N/A |
|  0%   44C    P1             96W /  400W |   26538MiB /  32607MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 5090        Off |   00000000:94:00.0 Off |                  N/A |
| 30%   47C    P1             80W /  400W |   30532MiB /  32607MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 5090        Off |   00000000:BF:00.0 Off |                  N/A |
|  0%   48C    P1             87W /  400W |   26444MiB /  32607MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

tacos4me

1 day ago

Working in Roo really well. 2x RTX 6000 setup = ~25t/s

mtcl

1 day ago

•

edited 1 day ago

Working in Roo really well. 2x RTX 6000 setup = ~25t/s

tacos4me That is quite impressive, what is the startup command you used and what is the rest of your machine configuration please?

tacos4me

1 day ago

EPYC 9115 + 12x64GB-5600

numactl -N 0 -m 0
./build/bin/llama-server
--model "$model"
--alias ubergarm/Kimi-K2-Instruct-0905
--ctx-size 98304
-ctk q8_0
-fa -fmoe
-mla 3
-ngl 99
-ot "blk.(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16).ffn_.=CUDA0"
-ot "blk.(19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34).ffn_.=CUDA1"
-ot exps=CPU
--parallel 1
--threads 24
--threads-batch 48
--numa numactl
--host 0.0.0.0
--port 8080

mtcl

1 day ago

•

edited 1 day ago

Hmm, interesting that you are able to offload 34 layers on 196 GB of VRAM and I am only able to offload 16 layers on 160 GB of vram, what magic did you do there! please advice!

EDIT: i noticed that you did not offload layers 17-18 so technically you offloaded 32 layers. But it is still quite large!

tacos4me

1 day ago

Try building ik_llama with these?

-DGGML_CUDA=ON
-DGGML_BLAS=OFF
-DCMAKE_CUDA_ARCHITECTURES="120"
-DGGML_MAX_CONTEXTS=2048
-DGGML_AVX=ON
-DGGML_SCHED_MAX_COPIES=1
-DGGML_AVX2=ON

tacos4me

1 day ago

Probably -DGGML_SCHED_MAX_COPIES=1

phakio

1 day ago

•

edited 1 day ago

IQ2_KS was a drop in replacement for me, thanks for the quick quantization!

3x3090, 1x4090, 256gb DDR5 Intel Sapphire Rapids QYFS

16t/s, standard // 21t/s with -ser 4,1!

also, for kimi, I've found I get better PP without the "-b 4096 -ub 4096" commands, which is strange because that command usually boosts my PP by a ton on other models... kimi has always had rather slow PP for me, right now it's around 75t/s

Time to put this model through its paces -

ubergarm

Owner 1 day ago

Yay, glad to see it is running for folks! Having a network issue right now, once I can get into the remote rig again will try to finish up a few more sizes for now then consider any "specialty" quants.

@bibproj

For my education, why are you giving it full Q8_0 instead of e.g. Q6_K on a IQ2_KS? Is it something specific with Kimi K2, or is it something you generally do?

Good question. Its all trade-offs. Here are a few things going through my mind when making these decisions:

The routed experts are a bulk of the model size and most important to quantize to reduce overall size. The attn/first dense layer/shexp are relatively small in terms of overall model size so okay to leave them larger q8_0.
Kimi-K2 is about the biggest open weights model, so people who have enough RAM+VRAM to use it probably will have at least 24GB. This is not always true, but given this using full Q8_0 for attn/first dense layer/shexp is fine.
There will be some penalty to TG given for the most part TG is memory bandwidth limited and keeping attn/first dense layer/shexp at full q8_0 does increase the overall size of the active parameters during TG. But given they are offloaded onto GPU VRAM anyway it probably won't bottleneck as bad as the system RAM with slower memory bandwidth.
For whatever reason I've yet to go back and look into the imatrix issue and why there doesn't seem to be data for the first dense layer and shexp. Because of this in this specific case I'm loathe to quantize those tensors given they won't have imatrix data. This is an unusual circumstance just in this one case and not a general issue.
Assuming I did have imatrix data for those, i typically like to go iq5_ks for those somtimes or iq6_k etc.
As for attention tensors, generally we're looking at (q|k|v|o) and generally (k|v) are very small so not a ton of benifit from shrinking those anyway. The usual advice is keep (k|v) one step up from (q|o) and so I could consider going with like iq6_k for attn_output especially as it is fairly large relative to the others. This is an MLA quant though so not quite as simple as dense model with "normal" (q|k|v|o) attention tensors. So easier to keep them q8_0 for full quality and minimize perplexity without a huge TG speed trade-off for many cases.
In the past on the earlier Kimi-K2 I had quantized some of these tensors more, but then compared and felt like this model is kind of sensitive as it has only one first dense layer and a single shared expert and decided to go in more on the quality side of the trade-off vs the speed side.

I forget who it was who had only 12 or 16GB VRAM but a bunch of RAM, but for a system like that with a single 3060TI 16GB or 5080TI etc they might benefit from some quantizing here just to fit more kv-cache. So if there is a request for a "speed mix" or similar I could try to quantize those tensors a little bit to go towards the speed side of the trade-off at the cost of quality.

Finally, generally Bartowski, mradermacher, and unsloth use more "standard" blends built into llama-cpp so they cover a lot of those quantization recipes already. Given I only work with --custom-q style quantization and recipes I try to specialize in max quality for a given memory footprint.

ubergarm

Owner 1 day ago

@tacos4me

Probably -DGGML_SCHED_MAX_COPIES=1

Interestingly ik made that the default just this past week: https://github.com/ikawrakow/ik_llama.cpp/pull/751

Definitely a must for MLA models running on multi-GPU configurations to avoid OOMing with huge CUDA buffers. Its fine to leave it explicit though!

mtcl

1 day ago

IQ2_KS was a drop in replacement for me, thanks for the quick quantization!

3x3090, 1x4090, 256gb DDR5 Intel Sapphire Rapids QYFS

16t/s, standard // 21t/s with -serv 4,1!

also, for kimi, I've found I get better PP without the "-b 4096 -ub 4096" commands, which is strange because that command usually boosts my PP by a ton on other models... kimi has always had rather slow PP for me, right now it's around 75t/s

Time to put this model through its paces -

Lol now this is bothering me!
Why am I unable to get such speeds :)
I have QYFS myself. And 5x5090
512GB 4800 MHz DDR5.

Would you please be able to give me your build and start-up command?

phakio

1 day ago

•

edited 1 day ago

I'll be honest I haven't rebuilt my ik_llama in weeks, here's my startup command... maybe drop the " -b 4096 -ub 4096", lower your ctx to 32k, or keep 64k ctx and quantize it to q4_0, shove more layers onto your gpus. you're only getting 4 more layers than me... also are all of your mobo ram slots full? 8 channel mobo?

/home/phone/Documents/ik_llama.cpp/build/bin/llama-server
--model /home/phone/Downloads/LocalModels/Kimi-K2-Instruct-0905-IQ2_KS-00001-of-00007.gguf
--alias ubergarm/Kimi-K2-Instruct-0905-IQ2_KS
--ctx-size 32768
-ctk q8_0
-mla 3 -fa -fmoe
-ngl 99
-ot "blk.(1|2|3).ffn_.=CUDA0"
-ot "blk.(4|5|6).ffn_.=CUDA1"
-ot "blk.(7|8|9).ffn_.=CUDA2"
-ot "blk.(10|11|12).ffn_.=CUDA3"
-ot exps=CPU
--parallel 1
--threads 48
--threads-batch 56
--host 0.0.0.0
--port 8081
--no-mmap


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.76.05              Driver Version: 580.76.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:16:00.0  On |                  Off |
|  0%   36C    P8             21W /  550W |   21571MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:40:00.0 Off |                  N/A |
|  0%   51C    P8             19W /  350W |   21044MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:6A:00.0 Off |                  N/A |
|  0%   48C    P8              8W /  350W |   21044MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off |   00000000:94:00.0 Off |                  N/A |
| 40%   37C    P8             25W /  350W |   21598MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

edit: proof in real usage I can hit 20 t/s with this setup

bibproj

1 day ago

@ubergarm

Given I only work with --custom-q style quantization and recipes I try to specialize in max quality for a given memory footprint.

That is why I asked you. 😀 I'm using this for translations, so for me quality is also more of a concern than speed.

Here are a few things going through my mind when making these decisions ...

felt like this model is kind of sensitive as it has only one first dense layer and a single shared expert and decided to go in more on the quality side of the trade-off vs the speed side

This makes so much sense. Thank you!

bibproj

1 day ago

•

edited 1 day ago

I had some info from DeepSeek 3.1 in Excel. With your explanation as background, a quick pivot table helps it sink in. (Edit: Percentages are of the sizes)

whoisjeremylam

1 day ago

EPYC 9115 + 12x64GB-5600

@tacos4me I've been thinking of getting an Epyc 9124 (+motherboard, + ram) to replace my painfully slow 2 channel memory motherboard.

I'm a newb when it comes to server-grade hardware. Would you or anyone else have some tips?

Do you have a single CPU or dual? What kind of motherboard do you have?

Thanks!

mtcl

1 day ago

•

edited 1 day ago

EPYC 9115 + 12x64GB-5600

@tacos4me I've been thinking of getting an Epyc 9124 (+motherboard, + ram) to replace my painfully slow 2 channel memory motherboard.

I'm a newb when it comes to server-grade hardware. Would you or anyone else have some tips?

Thanks!

Take a look at this channel and this video :) no self promotion I promise! ;)

https://www.youtube.com/watch?v=Xui3_bA26LE&list=PLteHam9e1Fecmd4hNAm7fOEPa4Su0YSIL&index=8

whoisjeremylam

1 day ago

Take a look at this channel and this video :) no self promotion I promise! ;)

Thanks, will check it out!

tacos4me

about 23 hours ago

•

edited about 23 hours ago

EPYC 9115 + 12x64GB-5600

@tacos4me I've been thinking of getting an Epyc 9124 (+motherboard, + ram) to replace my painfully slow 2 channel memory motherboard.

I'm a newb when it comes to server-grade hardware. Would you or anyone else have some tips?

Do you have a single CPU or dual? What kind of motherboard do you have?

Thanks!

The config of mine isn't even the greatest for straight CPU inference. Have builds on both GENOAD8UD-2T/X550 and MBD-H13SSL-N. You want to populate as many memory channels as ya can, with quickest memory you can. In a nutshell.

Be mindful of the CCD count on the CPUs you're looking at. If going to all the trouble of a dual socket system, you really want to be using a chip with 8+ CCD for best results, I believe. Which drives cost up even further. 9175F/9455 perhaps, but you're talking 2-3k/CPU if you want all the memory bandwidth.

gghfez

about 23 hours ago

Thanks mate, this is awesome. The model has more world knowledge than the previous one (it knows who some random people I know IRL are), and this quant is better than the UD-2xxs.

I agree with the guy who said it's a "drop in replacement" as well. That's another thing I like about your quants, they're about the same size within each family of models. eg. I can swap out the different Deepseeks without messing around with the -ot lines.

Also, I reckon they're training a reasoning model soon because when I gave it a complex bug in a neural codec trainer, it started adding "but wait, ..." and back-tracking in it's reply.
It also knows how to follow a and close a reasoning chain in /completion if I prefix the reply with Okay the user ", without being told to do so.

ubergarm

Owner about 20 hours ago

•

edited about 20 hours ago

@bibproj

Nice! Yeah your table tells the story of how much bigger the routed experts are than everything else in these big MoEs! It is a bit trickier to calculate the quantized size of "active experts" which does effect TG (token generation) speed.

@whoisjeremylam

Yeah as @tacos4me says essentially memory bandwidth is the name of the game to get more tok/sec generation speed with these big MoEs. What does your current 2 channel system get with mlc (intel memory latency checker) or similar (e.g. aida64 in windows)? You might be able to squeeze some more out of it tuning RAM without spending cash (though spending time). About the most you'll get with 2-channel AM5 "gamer rig" system is likely 90GB/s with 2 dimms and if you're lucky maybe 75GB/s with 4x dimms I'm guessing.

Once you go to larger EPYC or Xeon systems though you'll have to consider NUMA stuff and memory not local to the CCDs incurs a penalty - especially so for cross-socket dual CPU systems. Its not a well solved problem with llama cpp flavored inference engines. sglang had a recent paper about trying to make somewhat better for specific quantization dtypes e.g. int8 on newer saphire rappids AMX intel xeon systems.

Wendell of level1techs (who is sponsoring the hardware I'm using to cook these quants) has a recent video about an interesting combination of single socket EPYC 9575f and 12-ch RAM https://youtu.be/bOxAdRfTpJg?t=68 given that CPU has two GMI links per chiplet so it might be able to get the most memory bandwidth in a single NUMA node (NPS1). Though I have heard some reports from folks on ai beavers discord they were getting faster on NPS4 but I'm not sure how/why exactly or the details.

Anyway, have fun building a rig! The price jump from AM5 class gamer rigs to EPYC server rigs is pretty big though and geting 12x sticks of DDR5-6000+ is pricy!

whoisjeremylam

about 3 hours ago

Thank you @ubergarm , @tacos4me and @mtcl for all your replies.

I've got plenty to research and to learn about from what you've mentioned. The number of CCDs, dual CPUs, GMI links are all things I weren't aware of.

If I were to just look at the spec sheets, I might have made an expensive mistake!

Given my budget probably doesn't extend far enough to reach a 9005 (Turin) with 8 CCDs, I might look for a 9004 (Genoa) such as a 9554...

whoisjeremylam

26 minutes ago

FWIW, I did some perplexity runs with reduced experts.

Baseline: Final estimate: PPL = 3.2488 +/- 0.01723
ser 7,1 : Final estimate: PPL = 3.2455 +/- 0.01695
ser 6,1 : Final estimate: PPL = 3.2829 +/- 0.01694
ser 5,1 : Final estimate: PPL = 3.4187 +/- 0.01753

Running at -ser 7,1 improves TG on my rig by ~20%. YMMV!