IQ3_KS metrics on mixed CUDA + CPU, pretty good model!

#2
by Panchovix - opened

Amazing work as always! Model and quantization is pretty good.

Running on:

Fedora 41
Ryzen 7 7800X3D
192GB RAM
RTX 5090x2 (X8/X8 PCIe 5.0)
RTX 4090x2 (X4/X4 PCIe 4.0)
RTX 3090x2 (X4/X4 PCIe 4.0)
RTX A6000 (X4 PCIe 4.0)

Running with

./llama-server -m '/models_llm/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf' -c 16384 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26).ffn.=CUDA5" \
-ot "blk.(27|28|29|30|31|32|33).ffn.=CUDA6" \
-ot "ffn.*=CPU" \
-fa -mg 0 -ub 2048 -mla 1

It supports 32k and 64k ctx without issues, but for time I did the test with 16K only.

Got

llm_load_tensors:        CPU buffer size = 129603.59 MiB
llm_load_tensors:  CUDA_Host buffer size =   607.58 MiB
llm_load_tensors:      CUDA0 buffer size = 25932.96 MiB
llm_load_tensors:      CUDA1 buffer size = 20097.95 MiB
llm_load_tensors:      CUDA2 buffer size = 20097.95 MiB
llm_load_tensors:      CUDA3 buffer size = 25154.49 MiB
llm_load_tensors:      CUDA4 buffer size = 15426.02 MiB
llm_load_tensors:      CUDA5 buffer size = 15297.82 MiB
llm_load_tensors:      CUDA6 buffer size = 35999.45 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 1
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   180.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   126.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   126.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   162.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   144.00 MiB
llama_kv_cache_init:      CUDA5 KV buffer size =   126.00 MiB
llama_kv_cache_init:      CUDA6 KV buffer size =   234.00 MiB
llama_new_context_with_model: KV self size  = 1098.00 MiB, c^KV (f16): 1098.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  2735.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1524.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1476.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  1476.01 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  1476.01 MiB
llama_new_context_with_model:      CUDA5 compute buffer size =  1476.01 MiB
llama_new_context_with_model:      CUDA6 compute buffer size =  1476.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   184.02 MiB
llama_new_context_with_model: graph nodes  = 3542
llama_new_context_with_model: graph splits = 367

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |   10.472 |   195.56 |   61.475 |     8.33 |
|  2048 |    512 |   2048 |   10.906 |   187.79 |   61.989 |     8.26 |
|  2048 |    512 |   4096 |   11.650 |   175.80 |   62.586 |     8.18 |
|  2048 |    512 |   6144 |   12.380 |   165.43 |   63.525 |     8.06 |
|  2048 |    512 |   8192 |   13.099 |   156.35 |   64.193 |     7.98 |
|  2048 |    512 |  10240 |   13.842 |   147.95 |   64.489 |     7.94 |
|  2048 |    512 |  12288 |   14.564 |   140.62 |   65.574 |     7.81 |
|  2048 |    512 |  14336 |   15.314 |   133.73 |   65.829 |     7.78 |

Model itself is pretty good, and a good amount faster than other IQ3 quants.

@Panchovix

Great feedback! I got back to you over on level1techs forum and DM'd you there too! Thanks!

A few thoughts and questions to help optimize your command while I'm here. Assuming the following mapping

  • CUDA0 RTX 5090 (X8 PCIe 5.0)
  • CUDA1 RTX 5090 (X8 PCIe 5.0)
  • CUDA2 RTX 4090 (X4 PCIe 4.0)
  • CUDA3 RTX 4090 (X4 PCIe 4.0)
  • CUDA4 RTX 3090 (X4 PCIe 4.0)
  • CUDA5 RTX 3090 (X4 PCIe 4.0)
  • CUDA6 RTX A6000 (X4 PCIe 4.0)
./llama-server \
-m '/models_llm/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf' \
-c 16384 \ # <--- i usually run 40k for basic 1shot/2shot coding tasks etc
--no-mmap \ # <--- you can also try -rtr (for non _r4 quants), but recently non _r4 quants are faster with large batch sizes i think
-ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \ # <--- 0-2 are *different* ffn layers than exps, also shared exps are different so this is conflating things
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23).ffn.=CUDA4" \
-ot "blk.(24|25|26).ffn.=CUDA5" \
-ot "blk.(27|28|29|30|31|32|33).ffn.=CUDA6" \
-ot "ffn.*=CPU" \ # <--- again you want the `ffn_(down|gate|up)_exps` on CPU but *not* the other ffn stuff like ffn_(down|gate|up)_shexp etc
-fa \
-mg 0 \ 
-ub 2048 \
-mla 1 # <--- i always use 3 these days, 1 is the oldest.. 2 used to be for CUDA only but since 3 works now I go with it
# also no threads? i think u have 8 physical cores

I'll provide a new one to try below in a minute

Maybe give this a try. The goal is to put all of the attn, first 3 dense ffn layers, and single shared expert together on a single fastest GPU, then only put routed experts on other GPUs, and finally any remaining exps on CPU/RAM. I won't be able to guess accurately the sizes to prevent OOM, but maybe this will be enough to guide you.

Also, do you know how to keep kv-cache on a specific device? I wonder if that would help, assuming it is even possible.

./llama-server \
-m '/models_llm/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf' \
-c 16384 \
--no-mmap \
-ngl 999 \
-ot "blk\.(0|1|2)\.ffn.*=CUDA0" \ 
-ot "attn.*=CUDA0" \
-ot shexp=CUDA0 \
-ot "blk\.(3|4|5|6)\.ffn.*=CUDA0" \
-ot "blk\.(7|8|9|10)\.ffn.*=CUDA1" \
-ot "blk\.(11|12|13|14)\.ffn.*=CUDA2" \
-ot "blk\.(15|16|17|18|19)\.ffn.*=CUDA3" \
-ot "blk\.(20|21|22)\.ffn.*=CUDA4" \
-ot "blk\.(23|24|25)\.ffn.*=CUDA5" \
-ot "blk\.(26|27|28|29|30|31|32)\.ffn.*=CUDA6" \
-ot exps=CPU \
-fa -fmoe -mla 3 -amb 256 \
-mg 0 \ 
-ub 2048 \
--threads 8

I realized you were not passing -fmoe which is fused MoE and gives some speed-ups on all MoEs. Also -amb 256 is about as small as you want to go and will free up possibly a little more VRAM usually.

Might be able to increase -ub 4096 -b 4096 as well for even more PP, but probably can't offload quite as many layers.

Also here is my compile command for DeepSeek forcing BF16 as MLA needs that to prevent nan:

cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build build --config Release -j $(nproc)

Okay, I know you've tried some of this stuff before probably, and I don't have such an array of GPUs as you to test myself, but curious to see if putting all attn/shexp/first 3 dense layers onto single GPU helps at all so that the kv-cache might not have to weave across PCIe as much possibly.

I will test those! For some info, my command to build is

cmake -B build \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_FA_ALL_QUANTS=ON \
    -DGGML_BLAS=OFF \
    -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
    -DGGML_IQK_FA_ALL_QUANTS=1 \
    -DGGML_SCHED_MAX_COPIES=1 \
    -DGGML_CUDA_IQK_FORCE_BF16=1 \

I use -mla 1 because with -mla 3, compute buffers are quite bigger, specially on CUDA 0.

I have 8 physical cores yes (7800X3D)

I stopped passing fmoe as it killed TG performance when doing things like

-ot "blk.37.ffn_(norm|gate_inp|gate_shexp|down_shexp|up_shexp).weight=CUDA1" \
-ot "blk.37.ffn_gate_exps.weight=CUDA1" \
-ot "blk.37.ffn_(down_exps|up_exps).weight=CUDA2" \

I mentioned it a bit on here https://github.com/ikawrakow/ik_llama.cpp/issues/521

My GPU order is almost like that

CUDA0 RTX 5090 (X8 PCIe 5.0)
CUDA1 RTX 4090 (X4 PCIe 4.0)
CUDA2 RTX 4090 (X4 PCIe 4.0)
CUDA3 RTX 5090 (X8 PCIe 5.0)
CUDA4 RTX 3090 (X4 PCIe 4.0)
CUDA5 RTX 3090 (X4 PCIe 4.0)
CUDA6 RTX A6000 (X4 PCIe 4.0)

Okay so testing with the second command, modified a little, the loading looks like this, on pastebin because it is more than 65536 chars haha.

https://pastebin.com/7Kx7jPhW

Seems -amb 256 reduced a lot of the buffer sizes, but had to adjust for CUDA 0.

I noticed buffers don't increase a lot when actually generating. So I have to tinker a little bit yet.

With a quick llama server test (because I forgot to use llama sweep bench) I got about 6.6 t/s

I use -mla 1 because with -mla 3, compute buffers are quite bigger, specially on CUDA 0.

I believe -amb 256 is to limit the size of the compute buffers from exploding when doing -mla 3.Here is a reference that i believe is related and will get you to the PR

I stopped passing fmoe as it killed TG performance when doing things like

Right -fmoe would expect that ffn_gate_exp and ffn_up_exp are on the same device. It is unusual to split those tensors across multiple devices as you were showing in your example. No need to over-complicate things doing that, I'm not sure where this idea comes from, but yes I see you discussing it on 521. its funny you decided to avoid -fmoe and keep splitting tensors like that, I would argue the opposite: -fmoe is an important feature on ik_llama.cpp imo and I would strongly suggest using it and avoid splitting gate/up tensors.

I don't have a reference, but somewhere I vaguely recall ik questioning people who were quantizing ffn_(gate|up)_exps at different quantization levels as generally they kind of go together. Here is where -fmoe is defined as fusing those two tensors

Okay hope that makes sense!

With a quick llama server test (because I forgot to use llama sweep bench) I got about 6.6 t/s

Hrmm, okay so maybe not worth fussing with attn/shexp/first 3 dense ffn on single CUDA. I don't do it that way, but I only have 2x same model GPUs so wasn't sure.

Here is an example command that I use. The important thing is to take advantage of -fa -mla 3 -fmoe -amb 256 (or 512 whatever) and don't split the exps in a funny way. Start a 3 like I do to avoid messing with those first three dense layers and just let those and attn and shexp split naturally I suppose without overriding. Here is an simple example taken from how I'm running on 2x RTX A6000s. I adjust layers depending on which size model I'm using.

...
    -fa -mla 3 -fmoe -amb 512 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -ngl 99 \
    -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
    -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080

Anyway, I'll check in tomorrow to see how you're coming. I think you can squeeze some more out of it especially increasing -ub 4096 -b 4096 for more PP. I'll keep playing too and see what more I can learn haha...

Adjusting some layers I got this way to load it

./llama-sweep-bench \
-m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-merged.gguf' \
-c 16384 \
--no-mmap \
-ngl 999 \
-ot "blk\.(0|1|2)\.ffn.*=CUDA0" \
-ot "attn.*=CUDA0" \
-ot "shexp=CUDA0" \
-ot "blk\.(3|4|5)\.ffn.*=CUDA0" \
-ot "blk\.(6|7|8|9)\.ffn.*=CUDA1" \
-ot "blk\.(10|11|12|13)\.ffn.*=CUDA2" \
-ot "blk\.(14|15|16|17|18)\.ffn.*=CUDA3" \
-ot "blk\.(19|20|21|22)\.ffn.*=CUDA4" \
-ot "blk\.(23|24|25|26)\.ffn.*=CUDA5" \
-ot "blk\.(27|28|29|30|31|32|33|34)\.ffn.*=CUDA6" \
-ot "exps=CPU" \
-fa -fmoe -mla 3 -amb 256 \
-mg 0 \
-ub 2048 \
--threads 8

Increasing ub to 4096 makes compute buffers absolutely huge, but with -amb 256 it may be viable. It is mostly playing out with CUDA 0 to make sure it doesn't OOMs.

I did that way to use 1 layers on 2 GPUs, to be able to almost max the VRAM on each GPU, instead of example having unused 3 GB left.

And got some interesting results. Layers look like:

https://pastebin.com/ntJi3Prd

Did a small bench and got

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    9.231 |   221.86 |   56.531 |     9.06 |
|  2048 |    512 |   2048 |    9.223 |   222.06 |   75.684 |     6.76 |
|  2048 |    512 |   4096 |    9.572 |   213.95 |   93.136 |     5.50 |

So PP is higher but TG is slower. I will try with my OG post to run, but a bit adapted to mla 3, fmoe and amb!

Okay, now running with

./llama-sweep-bench -m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-merged.gguf' -c 16384 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" -ot "blk.(8|9|10|11).ffn.=CUDA1" -ot "blk.(12|13|14|15).ffn.=CUDA2" -ot "blk.(16|17|18|19|20).ffn.=CUDA3" -ot "blk.(21|22|23).ffn.=CUDA4" -ot "blk.(24|25|26).ffn.=CUDA5" -ot "blk.(27|28|29|30|31|32|33).ffn.=CUDA6" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 3 -fmoe -amb 256 --threads 8

got

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    9.479 |   216.05 |   66.115 |     7.74 |
|  2048 |    512 |   2048 |    9.576 |   213.88 |   61.981 |     8.26 |
|  2048 |    512 |   4096 |   10.109 |   202.59 |   61.915 |     8.27 |
|  2048 |    512 |   6144 |   10.605 |   193.11 |   63.156 |     8.11 |
|  2048 |    512 |   8192 |   11.227 |   182.41 |   63.618 |     8.05 |
|  2048 |    512 |  10240 |   11.745 |   174.37 |   64.337 |     7.96 |
|  2048 |    512 |  12288 |   12.321 |   166.22 |   64.624 |     7.92 |
|  2048 |    512 |  14336 |   12.798 |   160.02 |   65.049 |     7.87 |

I like that PP upgrade! Now I think I can add 1 layer on each 3090 and 1 layer to the A6000, thanks to amb 256, so gonna see how it goes

Okay, and running with

./llama-sweep-bench -m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-merged.gguf' -c 16384 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" -ot "blk.(8|9|10|11).ffn.=CUDA1" -ot "blk.(12|13|14|15).ffn.=CUDA2" -ot "blk.(16|17|18|19|20).ffn.=CUDA3" -ot "blk.(21|22|23|24).ffn.=CUDA4" -ot "blk.(25|26|27|28).ffn.=CUDA5" -ot "blk.(29|30|31|32|33|34|35|36).ffn.=CUDA6" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 3 -fmoe -amb 256 --threads 8

Got

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  2048 |    512 |      0 |    9.214 |   222.27 |   56.959 |     8.99 |
|  2048 |    512 |   2048 |    8.956 |   228.66 |   57.355 |     8.93 |
|  2048 |    512 |   4096 |    9.488 |   215.86 |   57.770 |     8.86 |
|  2048 |    512 |   6144 |    9.977 |   205.27 |   58.641 |     8.73 |

That's quite an improvement! I don't think I can increase -ub and -b as 24gb gpus have just 1GB or 800MB left. I could reducing 1 layer on each GPU except the A6000, but the TG t/s may be impacted. But for testing I may do it anyways.

-mla 3 + -amb 256 is pure magic. Now I have to re-do all my commands that I used to load adapted to that... lol.

EDIT: Okay maybe with -amb 256 I could increase ub/b more than I expected. But with 1 layer less on each gpu, and ub/b 4096, I get

main: n_kv_max = 16384, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   14.073 |   291.04 |  134.280 |     7.63 |
|  4096 |   1024 |   4096 |   15.358 |   266.71 |  136.395 |     7.51 |
|  4096 |   1024 |   8192 |   17.210 |   238.00 |  139.936 |     7.32 |

EDIT2: Using the first command with 4096 ub/b I get this, but then an OOM

main: n_kv_max = 16384, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   12.640 |   324.06 |  117.220 |     8.74 |
|  4096 |   1024 |   4096 |   14.045 |   291.64 |  119.807 |     8.55 |

Hey mate, would you happen to know why this quant: DeepSeek-TNG-R1T2-Chimera-GGUF/IQ1_S
PP=56.92 tokens per second, TG=14.68 tokens per second

Where as these two: DeepSeek-R1-0528-IQ1_S_R4 and DeepSeek-V3-0324-IQ1_S_R4 - get

PP=210 tokens per second, TG=16 tokens per second

I was trying to keep up with all these new formats / cli parameters, but there's a lot for me to parse now -_-!

This is the exact command I've been using for the past few weeks, and I just dropped in this model in place of the other two I'd been switching between:

./llama-server -m /models/gguf/DeepSeek-TNG-R1T2-Chimera-IQ1_S/DeepSeek-TNG-R1T2-Chimera-IQ1_S-00001-of-00003.gguf -c 16384 --no-mmap --no-warmup -ngl 999 -mla 3 --host 0.0.0.0 --port 8080 -fa -mg 0 --ubatch-size 2048 -fmoe \
-ot blk\.(1|2|3|4|5|6|32|36|40)\.ffn.*=CUDA0 \
-ot blk\.(7|8|9|10|11|12|28|35)\.ffn.*=CUDA1 \
-ot blk\.(13|14|15|16|17|29|34|39)\.ffn.*=CUDA2 \
-ot blk\.(18|19|20|21|22|30|32|38|41)\.ffn.*=CUDA3 \
-ot blk\.(23|24|25|26|27|31|33|37)\.ffn.*=CUDA4 \
-ot ffn.*=CPU -ctk q8_0 -ctv q8_0

All GPUs are RTX 3090's and CPU is a Threadripper 7960X 24-Core.

I'm guessing I need to wait for an R4 quant or read up on how to create one myself?

@gghfez

You guys with 5 GPUs have the biggest commands haha, I love it!

I'm guessing I need to wait for an R4 quant or read up on how to create one myself?

Right I'm planning to release IQ1_S_R4 soon. Normally you can simply do -rtr if you have enough RAM to hold the entire model (as it disables mmap()), or usually you can llama-quantize --repack input.gguf - HOWEVER the IQ1_S and IQ1_S_R4 are exceptions to that rule...

Also, if you read my comments here the IQ1_S may actually run faster than the _R4 if you use high enough ub e.g. -ub 4096 -b 4096...

So you kinda have to play around...

I'll make a few minor modifications to your command too including leaving the first three dense layers [0-2].ffn_(gate|up|down not overriden, and only override the exps not all the ffns which include the shexp which shouldn't be on CPU. So somwhat similar to what I mentioned above (also remember to compile like shown above too):

./llama-server \
-m /models/gguf/DeepSeek-TNG-R1T2-Chimera-IQ1_S/DeepSeek-TNG-R1T2-Chimera-IQ1_S-00001-of-00003.gguf \
--no-mmap \
--no-warmup \
-c 16384 \
-ctk q8_0 -ctv q8_0 \
--fa -fmoe -mla 3  -amb 512 \
-mg 0 \
--ubatch-size 2048 \
-ngl 999 \
-ot blk\.(3|4|5|6|32|36|40|41|42)\.ffn.*=CUDA0 \
-ot blk\.(7|8|9|10|11|12|28|35)\.ffn.*=CUDA1 \
-ot blk\.(13|14|15|16|17|29|34|39)\.ffn.*=CUDA2 \
-ot blk\.(18|19|20|21|22|30|32|38|41)\.ffn.*=CUDA3 \
-ot blk\.(23|24|25|26|27|31|33|37)\.ffn.*=CUDA4 \
-ot exps=CPU \
--host 0.0.0.0 \
--port 8080

Then consider -ub 4096 -b 4096 to get the full benefits of non _r4 which would probably require you to adjust all your layer offloads. Its a PITA I'm sure but try to keep them sequential to avoid extra passing of data between cards (i think it works that way, but would need to use that pcie bus profiler tool)

Also take advantage of -amb 512 or 256 if you want for a little extra VRAM with very minor speed overhead.

@Panchovix

EDIT2: Using the first command with 4096 ub/b I get this, but then an OOM

Nice, definitely eeking out some noticeable improvements.

Let's do one more pass, more like your original command but a few more changes like I just suggested gghfez about being specific about using exactly -ot exps=CPU as there shexp layer is getting caught with your existing wildcard psure (you don't want that on CPU as every token always uses it). I left those first 3 dense layers alone on yours as otherwise you'd probably have to re-balance the layers again haha...

./llama-sweep-bench \
-m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-merged.gguf' \
-c 16384 \
--no-mmap \
-ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
-ot "blk.(16|17|18|19|20).ffn.=CUDA3" \
-ot "blk.(21|22|23|24).ffn.=CUDA4" \
-ot "blk.(25|26|27|28).ffn.=CUDA5" \
-ot "blk.(29|30|31|32|33|34|35|36).ffn.=CUDA6" \
-ot exps=CPU \
-fa -mg 0 -mla 3 -fmoe -amb 256 \
-ub 4096 -b 4096 \
--threads 8

This would probably require taking off a little more to not OOM with 4096 or you could try -ub 2048 -b 2048 or maybe even 3072. I've also noticed when increasing -ub to higher values it will startup okay but might OOM later so leave a little extra room.

Okay, enjoy tweaking your rigs and cranking out the last bit of speed!

@ubergarm many thanks! I keep tweaking now for 32K ctx haha.

But I was wondering, have you compared on any of these deepseek models, q8 for cache vs fp16? Because i.e. for that command, I could use 32k and just q8.

At fp16 I'm losing some layers completely hmmm.

HOWEVER the IQ1_S and IQ1_S_R4 are exceptions to that rule...

And IQ1_M/IQ1_M_R4

@gghfez

The answer to why IQ1_S_R4 and IQ1_S behave so different is they are far more different then just an _R4 version of IQ1_S.

Read up on full details here: https://github.com/ikawrakow/ik_llama.cpp/pull/185

@ubergarm Thank you! Your changes improved the prompt processing (I also had to take layer42 off cuda0 to make it fit)

4329 tokens (    4.56 ms per token,   219.32 tokens per second
1401 runs   (   68.50 ms per token,    14.60 tokens per second

You guys with 5 GPUs have the biggest commands haha, I love it!

Yeah it was painful when I was using rpc and had to wait 3 minutes after every tweak for the CUDA OOM after sending 20gb over the 2.5gbit link each time -_-!
Thank you for taking the time to adjust mine!

The answer to why IQ1_S_R4 and IQ1_S behave so different is they are far more different then just an _R4 version of IQ1_S.

Thanks for that, I do need to take the time to catch up on all this again.

Weirdly, I squeezed the IQ2_K in at 8k context earlier and found it to run about as fast as the IQ1_S, so I think I'll upgrade to 192gb DDR5 and run that with less layers on the GPUs.

have you compared on any of these deepseek models, q8 for cache vs fp16?

I haven't compared these specific quants, but a few months ago, fp16 was almost 2x faster than q8. However, recently I saw someone on reddit using q8 to squeeze more layers into VRAM so I tried it as well, and that prompt processing and textgen improved.

@ubergarm okay I admit it, using -ot "ffn.*=CPU" was dumb on my part, and -ot exps=CPU is way faster

Running for 32K, with 1 layer less on each GPU except the A6000 (so 6 layers on GPU) I get this

main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   12.125 |   337.82 |  122.400 |     8.37 |
|  4096 |   1024 |   4096 |   12.540 |   326.62 |  124.413 |     8.23 |

And with your suggestion at ub/b 3072, I get


main: n_kv_max = 16384, n_batch = 3072, n_ubatch = 3072, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  3072 |    768 |      0 |    9.836 |   312.32 |   79.159 |     9.70 |
|  3072 |    768 |   3072 |   10.078 |   304.83 |   80.262 |     9.57 |
|  3072 |    768 |   6144 |   10.673 |   287.83 |   82.363 |     9.32 |

This is way better! So well, welp, time to redo everything

Hello and thank you for this quant. I managed to fit it with 72k context and -ub 3840 and I get 180t/s pp and 6.8 t/s .

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
3840 960 0 21.191 181.21 141.379 6.79
3840 960 3840 22.249 172.59 144.163 6.66

This model is cca 5GB bigger than the UD_Q3_XL that gets like 7.15 t/s which is justified by having 5GB less in the slow RAM. Now I will be using this one as i hope the extra 5GB could make it just a little bit smarter and it is the real limit to my system memory :) 256GB 3200MT/s DDR4 and 2x3090 + A4500 20GB.

This is the command I am using:
CUDA_VISIBLE_DEVICES="0,1,2"
./build/bin/llama-sweep-bench
--model /media/ciprian/ssd/models/chimera-R1T2/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf
--alias DeepSeek-Chimera-Uber-R1T2-Q3
--ctx-size 72960
-ctk q8_0
-mla 3 -fa
-amb 256
-fmoe
--temp 0.6
--top-p 0.95
--n-gpu-layers 63
-ot "blk.[3-4].ffn_up_exps=CUDA0,blk.[3-4].ffn_gate_exps=CUDA0,blk.[3-4].ffn_down_exps=CUDA0"
-ot "blk.1[0-3].ffn_up_exps=CUDA1,blk.1[0-3].ffn_gate_exps=CUDA1"
-ot "blk.1[4-5].ffn_up_exps=CUDA2,blk.1[4-5].ffn_gate_exps=CUDA2,blk.1[4].ffn_down_exps=CUDA2"
--override-tensor exps=CPU
--parallel 1
--threads 16
--threads-batch 16
--host 0.0.0.0 --port 5002
--ubatch-size 3840 --batch-size 3840 --no-mmap

-ot exps=CPU is way faster
Yeah this is a game changer. I'm using the R4 R1 quant again with -ub 3072

2830 tokens (    3.17 ms per token,   315.55 tokens per second)
745 runs   (60.49 ms per token,    16.53 tokens per second)

And for just 'hi' it's nearly 18 t/s, I thought I'd loaded the wrong model initially lol

294 runs   (   56.36 ms per token,    17.74 tokens per second)

I've ordered more RAM to run the smarter quants thanks to this! The Q2 is more sonnet-like with solving problems.

@Panchovix

But I was wondering, have you compared on any of these deepseek models, q8 for cache vs fp16? Because i.e. for that command, I could use 32k and just q8.

For DeepSeek i pretty much always use -ctk q8_0 to allow double MLA kv-cache in the same amount of VRAM as fp16 with minimal perplexity loss.

I measured the difference using my DeepSeek-R1-0528-Q8_0.gguf with fp16 vs q8_0 kv-cache:

  • -ctk fp16 3.2119 +/- 0.01697
  • -ctk q8_0 3.2130 +/- 0.01698

It is basically within noise in terms of the effect on quality.

The only time i use fp16 unquantized kv-cache is with small dense models e.g. Qwen-14B when fully offloaded as fp16 can actually e faster than q8_0 on CUDA despite larger bit depth. But in general q8_0 is almost always the way to go for hybrid CPU+GPU inferencing imo.

This is way better! So well, welp, time to redo everything

Great, over 300 tok/sec PP sounds more like what I'd expect given ik's recent improvements! -ub 4096 -b 4096 can be pushed even larger, but I haven't tried it much personally and find it to be a good balance considering both PP and TG.

@ciprianv

Now I will be using this one as i hope the extra 5GB could make it just a little bit smarter

In general the quants I make with ik's new quantization types e.g. IQ3_KS are better quality (lower perpleixty / KLD) aka "smarter" than mainline quants of the same size. So even if my model was 5GB smaller it might still be "smarter". So yeah glad you are enjoying these ik quants and find them to fit your system!

Let's take a look at your command now:

CUDA_VISIBLE_DEVICES="0,1,2" \
./build/bin/llama-sweep-bench \
--model /media/ciprian/ssd/models/chimera-R1T2/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf \
--alias DeepSeek-Chimera-Uber-R1T2-Q3 \
--ctx-size 72960 \
-ctk q8_0 \
-mla 3 -fa \
-amb 256 \
-fmoe \
--temp 0.6 \
--top-p 0.95 \
--n-gpu-layers 63 \
-ot "blk.[3-4].ffn_up_exps=CUDA0,blk.[3-4].ffn_gate_exps=CUDA0,blk.[3-4].ffn_down_exps=CUDA0" \
-ot "blk.1[0-3].ffn_up_exps=CUDA1,blk.1[0-3].ffn_gate_exps=CUDA1" \
-ot "blk.1[4-5].ffn_up_exps=CUDA2,blk.1[4-5].ffn_gate_exps=CUDA2,blk.1[4].ffn_down_exps=CUDA2" \
--override-tensor exps=CPU \
--parallel 1 \
--threads 16 \
--threads-batch 16 \
--host 0.0.0.0 --port 5002 \
--ubatch-size 3840 --batch-size 3840 \
--no-mmap

Looks pretty good except your regex seem more confusing than necessary. I'm not sure why people are fussing with individual tensors, maybe trying to cram the VRAM as full as possible or something? At least you didn't split up (up|gate) to take full advantage of -fmoe that is good.

The only thing I can think of possibly changing off the top of my head would be something like:

-ngl 99 \
-ot "blk\.(3|4)\..*exps=CUDA0" \
-ot "blk\.(5|6|7)\..*exps=CUDA1" \
-ot "blk\.(8|9|10)\..*exps=CUDA2" \
-ot exps=CPU \
-mg 0 \
-ub 4096 -b 4096 \

I'm not 100% sure on what sizes of ub/b are optimal but sometimes keeping it a power of 2 might be faster, I'd have to check Maybe @tdh111 would know as they are quite knowledgeable.

Thank you for the feedback and if you have the time, please, do a IQ3_KS version also for V3/R1. Also, can you tell me, please, what -mg 0 does and why it is useful? Thank you again!

P.S. I am using up+gate and optional down when I don't have space for all 3 and I dislike having cca 3GB free memory on a gpu, in this way I use the most of my vram :)

@ubergarm amazing! Then I will use ctk q8 for 64K, that diff is minuscule!

And effectively, got higher speeds.

On the Q2 variant for some small tests (with mla 1), with b/ub 5120 I got

main: n_kv_max = 32768, n_batch = 5120, n_ubatch = 5120, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  5120 |   1280 |      0 |   12.481 |   410.21 |  104.088 |    12.30 |
|  5120 |   1280 |   5120 |   14.630 |   349.98 |  109.724 |    11.67 |
|  5120 |   1280 |  10240 |   17.167 |   298.25 |  112.938 |    11.33 |
|  5120 |   1280 |  15360 |   20.008 |   255.90 |  119.037 |    10.75 |
|  5120 |   1280 |  20480 |   22.444 |   228.12 |  122.706 |    10.43 |

On the Q3 one (with mla 3) and b/ub 6144

main: n_kv_max = 32768, n_batch = 6144, n_ubatch = 6144, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  6144 |   1536 |      0 |   15.406 |   398.81 |  174.929 |     8.78 |
|  6144 |   1536 |   6144 |   18.289 |   335.94 |  180.393 |     8.51 |
|  6144 |   1536 |  12288 |   22.229 |   276.39 |  186.113 |     8.25 |
|  6144 |   1536 |  18432 |   24.533 |   250.44 |  191.037 |     8.04 |
|  6144 |   1536 |  24576 |   28.122 |   218.48 |  196.268 |     7.83 |

And on q3, mla 3, b/ub 8192 (can't believe it works)

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  8192 |   2048 |      0 |   20.147 |   406.61 |  232.476 |     8.81 |
|  8192 |   2048 |   8192 |   26.009 |   314.97 |  242.648 |     8.44 |
|  8192 |   2048 |  16384 |   32.628 |   251.07 |  253.309 |     8.09 |
|  8192 |   2048 |  24576 |   39.010 |   210.00 |  264.415 |     7.75 |

So here it seems it is to reach a ceiling with scaling.

The TG is heavily bound by my single CCD + RAM, at it has just a max of 60-64 GB/s, but it seems PP just gets wildy higher now until some specific batch/ubatch size, which may be a mix of a CPU bound issue or a PCIe issue, as it saturates PCIe 5.0 X8 at 27-28 GiB/s.

BTW for reference, as I need to use less layers (no mla 3 and no amb), on normal/main llamacpp, I get about half of this max PP for speed, and about 75% of the TG. So ikllamacpp is noticeably faster for my setup.

Also +1 to @ciprianv for a IQ3_KS variant for V3 0324 and R1 0528! That way I could delete Q3_K_XL, which is bigger but as you say, the quality of your models can match or surpass them with less size.

P.S. I am using up+gate and optional down when I don't have space for all 3 and I dislike having cca 3GB free memory on a gpu, in this way I use the most of my vram :)

Yeah the impression I get is people are splitting tensors to try to max out VRAM allocation. I'd suggest if u have some extra VRAM just increase context a bit or increase batch sizes a little bit. But if it is working for your system then by all means go for it!

Yeah the new IQ3_KS seems to be a really great size for DeepSeek 671B models. I'll consider back-filling the other models with it as well if I get some time after clearing up more disk space haha...

Thanks y'all for the wild ride hah!

UPDATE:
Okay, currently cooking DeepSeek-R1-0528-IQ3_KS.gguf keep an eye out over on https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF

Just to let you know guys, did some benchmarks on iklcpp on my setup (192GB RAM + 208GB VRAM) on DeepSeek V3/R1/Chimera of Q2_K_XL, IQ3_XXS, IQ3_KS, Q3_K_XL and IQ4_XS on reddit, if you want to take a look!

https://www.reddit.com/r/LocalLLaMA/comments/1lwnj5x/performance_benchmarks_on_deepseek/

Performance of ikllamacpp for these kind of setups, is really impressive!

and @ubergarm gave you a lot of credit but I forgot your reddit username, sorry, but you can take a look and comment there! And many thanks, will wait eagerly for that R1-0528 IQ3_KS and maybe for a V3-0324 with the same treatment, you are the goat.

@Panchovix @ciprianv

Good news, currently uploading my latest recipe ubergarm/DeepSeek-R1-0528-GGUF-IQ3_KS with Final estimate: PPL = 3.2983 +/- 0.01759. Weighs in at 281.463 GiB (3.598 BPW) so hopefully about right size. Better Perplexity than similar sized Q3_K_XL quants.

(192GB RAM + 208GB VRAM)

Am I going crazy, or is your VRAM count gradually increasing between these posts? lol

That way I could delete Q3_K_XL,

FYI - If you like that specific quant, just be aware before you delete it, I've noticed Unsloth seem to randomly re-quantize + override their quants, so you might not be able to re-download that specific one if they've done this.

@ubergarm amazing, many thanks! I know I ask a lot, but if you do it for V3 0324 it would be really appreciated. I still really like that model.

@gghfez since time I haven't tested new models, but I did repair a 3090 and the A6000, that's why I have more GPUs now. But I can't add anymore without changing my platform haha.

And ah I see, interesting. Maybe they recalibrate it or something?

V3 0324 it would be really appreciated. I still really like that model.

Hah we shall see, if i wait a week they might release V3 0724 😛

@Panchovix Ah, you're the guy who soldered his GPU, I recall seeing that comment somewhere lol
I've got the same platform limit but for DDR5, only 4 slots on my TRX50 (because I bought it just before fucking Deepseek-R1 dropped and suddenly CPU/system memory was relevant :)
And yeah, they adjust their imatrix, or adjust the chat template, or some user says "hey I've got this hardware, could you do this size instead?" and they seem to actually tweak it slightly.

Hah we shall see, if i wait a week they might release V3 0724
Building these quants requires a GPU and the bf16 gguf right? (I can't just rent a CPU instance with 768GB of RAM I assume)

Thanks for doing these quants by the way, especially since you don't have 5 GPUs to run them yourself!

Building these quants requires a GPU and the bf16 gguf right? (I can't just rent a CPU instance with 768GB of RAM I assume)

Surprisingly, a CPU instance with 768GB RAM is exactly what you want to make these quants haha... If you are quantizing with vllm-compressor or exllamav3, those projects do use a GPU and require enough VRAM to fit the largest tensors at least. However, ikawrakow has implemented the llama-quantize code to work with CPU only and it is quite efficient on RAM even.

The biggest challenge is inferencing with the largest model to get the imatrix dat file. I run the big 666GiB Q8_0 CPU only and let it crank away for 4 hours or so to get that file. Q8_0 is fine as the original model is natively fp8. For smaller models I do try to use the bf16 yes.

Feel free to use the imatrix dat files I provide to quant your own models on a home gaming rig or whatever as that only takes less than ~32GB RAM, a few TB of disk space, and some patience!

Thanks for doing these quants by the way, especially since you don't have 5 GPUs to run them yourself!

Glad y'all having fun running and benchmarking with them!

@ubergarm do you have a guide how maybe I would replicate it on DeepSeek V3? I guess I would need the FP16 right?

Now I also wonder something, is the IQ3_KS quant here compatible with normal llamacpp? I haven't even tested lol.

@Panchovix

do you have a guide how maybe I would replicate it on DeepSeek V3? I guess I would need the FP16 right?

yeah i have a basic quant cookers guide here on ik's discussions. It doesn't cover the evshiron + triton-cpu native fp8 safetensors to bf16 GGUF step however. For ik_llama.cpp you will want a bf16 that was done that way (fairydreaming's original MLA implementation style which keeps all the attn_k_b attn_v_b attn_kv_b tensor business correct). You could then use my imatrix to skip that step as it is the most difficult in terms of hardware required.

I believe @Thireus https://huggingface.co/Thireus/models has gone through this process and currently working on releasing information and a new project on github. Not sure if there is a V3-0324 bf16 already available done this way you could start with.

I've already moved my V3-0324 bf16 to a network storage, and it would take some time to pull it back over to cook the quant you want. If you really are after it, DM me again and we'll work it out ;p

is the IQ3_KS quant here compatible with normal llamacpp?

No mainline does not support it. Though @Nexesenex https://github.com/Nexesenex/croco.cpp/ might support it for some architectures.

Feel free to use the recipes for my IQ3_KS as a starting point if you decide to give it a go! (I also have a little more info on that fp8 -> bf16 buried in my old guide, holler at me if u need the reference).


While I'm here, I just cooked a brand new IQ2_KT 171.146 GiB (2.188 BPW) uploading currently. Perplexity is pretty good at Final estimate: PPL = 3.8887 +/- 0.02191. You could fully offload it on all your GPUs and likely get much better speeds than touching the CPU for hybrid setups. Upload should complete in a couple hours, README already has info and recipes.

Sent you a DM on the level1forums! Sorry I insist so much on V3 haha, but this model is just the best one open source without reasoning.

And pretty interesting, I will try that one! In theory with buffers it may not fit but maybe with less CTX. MultiGPU has some compute buffer overhead I think and it uses quite more.

@ubergarm okay did the quick test on IQ2_KT and got this

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 999, n_threads = 8, n_threads_batch = 8

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  1024 |    256 |      0 |    2.005 |   510.77 |    8.588 |    29.81 |
|  1024 |    256 |   1024 |    1.970 |   519.78 |    8.736 |    29.30 |
|  1024 |    256 |   2048 |    2.138 |   478.86 |    8.845 |    28.94 |
|  1024 |    256 |   3072 |    2.289 |   447.34 |    9.114 |    28.09 |
|  1024 |    256 |   4096 |    2.490 |   411.23 |    9.248 |    27.68 |
|  1024 |    256 |   5120 |    2.660 |   384.95 |    9.445 |    27.10 |
|  1024 |    256 |   6144 |    2.832 |   361.63 |    9.669 |    26.48 |
|  1024 |    256 |   7168 |    2.990 |   342.44 |    9.761 |    26.23 |
|  1024 |    256 |   8192 |    3.250 |   315.04 |   10.047 |    25.48 |
|  1024 |    256 |   9216 |    3.421 |   299.31 |   10.129 |    25.27 |
|  1024 |    256 |  10240 |    3.593 |   284.96 |   10.222 |    25.04 |
|  1024 |    256 |  11264 |    3.752 |   272.90 |   10.536 |    24.30 |
|  1024 |    256 |  12288 |    3.923 |   261.02 |   10.635 |    24.07 |
|  1024 |    256 |  13312 |    4.094 |   250.15 |   10.841 |    23.61 |
|  1024 |    256 |  14336 |    4.273 |   239.62 |   10.954 |    23.37 |
|  1024 |    256 |  15360 |    4.456 |   229.81 |   10.991 |    23.29 |

Each GPU is almost at the VRAM limit, but on the 3090s, so I have about 8gb left that hurts lol. Can't quite increase ub.

Sign up or log in to comment