6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama and this LLM: Thank's !!!
Hello,
Cpu :Epyc 7532
Ram :512 Go ram DDR4 2933 Mhz
GPU :7 RTX 3090 ( after Load 2go vram free per GPU )
Max context : 65000 Tokens
./build/bin/llama-server --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 --model /pathofmodels/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 --ctx-size 65000 -ctk q8_0 -mla 3 -fa -amb 768 -b 1536 -ub 1536 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU --parallel 1 --threads 60 --host 0.0.0.0 --port 8090 -n -1 --no-mmap --timeout 0 --cont-batching --ignore-eos
Hi @martossien , glad you got it working, but you should be able to get higher tokens per second with 7 GPUs!
The original command line is intended for a single GPU, so that all the attention tensors are on the GPU and all the MoE tensors are on the CPU. However, with multiple GPUs, you want to allocate attention and MoE tensorts across GPUs in a way that avoids splitting individual layers across GPUs.
See the discussion here on how to allocate layers to GPUs.
FWIW, on one socket (VM with 45 out of 48 cores) of a dual 7R32 and 1 4090, I get :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.6 --top-p 0.95 --ctx-size 41000 -ctk q8_0 -mla 2 -fa -amb 512 -b 1024 -ub 1024 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU
main: n_kv_max = 41216, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 12.687 | 80.71 | 55.248 | 4.63 |
1024 | 256 | 1024 | 12.597 | 81.29 | 55.708 | 4.60 |
1024 | 256 | 2048 | 12.740 | 80.38 | 55.497 | 4.61 |
1024 | 256 | 3072 | 12.830 | 79.81 | 55.698 | 4.60 |
1024 | 256 | 4096 | 12.957 | 79.03 | 55.894 | 4.58 |
1024 | 256 | 5120 | 13.240 | 77.34 | 56.627 | 4.52 |
1024 | 256 | 6144 | 13.344 | 76.74 | 57.557 | 4.45 |
1024 | 256 | 7168 | 13.462 | 76.06 | 57.923 | 4.42 |
1024 | 256 | 8192 | 13.609 | 75.24 | 58.129 | 4.40 |
1024 | 256 | 9216 | 13.713 | 74.68 | 58.473 | 4.38 |
1024 | 256 | 10240 | 13.848 | 73.95 | 57.839 | 4.43 |
1024 | 256 | 11264 | 13.993 | 73.18 | 58.573 | 4.37 |
1024 | 256 | 12288 | 14.118 | 72.53 | 59.009 | 4.34 |
1024 | 256 | 13312 | 14.262 | 71.80 | 59.064 | 4.33 |
1024 | 256 | 14336 | 14.451 | 70.86 | 59.469 | 4.30 |
1024 | 256 | 15360 | 14.535 | 70.45 | 59.907 | 4.27 |
1024 | 256 | 16384 | 14.692 | 69.70 | 60.068 | 4.26 |
1024 | 256 | 17408 | 14.859 | 68.91 | 60.391 | 4.24 |
1024 | 256 | 18432 | 14.907 | 68.69 | 60.677 | 4.22 |
1024 | 256 | 19456 | 14.924 | 68.61 | 60.668 | 4.22 |
1024 | 256 | 20480 | 15.028 | 68.14 | 60.785 | 4.21 |
1024 | 256 | 21504 | 15.281 | 67.01 | 61.706 | 4.15 |
1024 | 256 | 22528 | 15.399 | 66.50 | 62.088 | 4.12 |
1024 | 256 | 23552 | 15.586 | 65.70 | 61.825 | 4.14 |
1024 | 256 | 24576 | 15.806 | 64.79 | 62.109 | 4.12 |
1024 | 256 | 25600 | 15.871 | 64.52 | 62.338 | 4.11 |
1024 | 256 | 26624 | 16.017 | 63.93 | 62.855 | 4.07 |
1024 | 256 | 27648 | 16.108 | 63.57 | 63.015 | 4.06 |
1024 | 256 | 28672 | 16.299 | 62.83 | 63.289 | 4.04 |
1024 | 256 | 29696 | 16.413 | 62.39 | 63.697 | 4.02 |
1024 | 256 | 30720 | 16.440 | 62.29 | 63.623 | 4.02 |
1024 | 256 | 31744 | 16.486 | 62.11 | 63.536 | 4.03 |
1024 | 256 | 32768 | 16.685 | 61.37 | 64.087 | 3.99 |
1024 | 256 | 33792 | 16.825 | 60.86 | 64.818 | 3.95 |
1024 | 256 | 34816 | 16.911 | 60.55 | 65.375 | 3.92 |
1024 | 256 | 35840 | 17.107 | 59.86 | 65.864 | 3.89 |
1024 | 256 | 36864 | 17.370 | 58.95 | 65.537 | 3.91 |
1024 | 256 | 37888 | 17.332 | 59.08 | 65.763 | 3.89 |
1024 | 256 | 38912 | 17.440 | 58.72 | 65.743 | 3.89 |
1024 | 256 | 39936 | 17.624 | 58.10 | 66.308 | 3.86 |
failed to decode the batch, n_batch = 1024, ret = 1 | ||||||
main: llama_decode() failed |
I'm pleasantly surprised by the pp speed.
It's not clear yet to me what the difference is between mla 2 and mla 3, but here are the results with mla 3 :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.6 --top-p 0.95 --ctx-size 41000 -ctk q8_0 -mla 3 -fa -amb 512 -b 1024 -ub 1024 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU
main: n_kv_max = 41216, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
1024 | 256 | 0 | 12.932 | 79.18 | 54.989 | 4.66 |
1024 | 256 | 1024 | 12.830 | 79.81 | 55.405 | 4.62 |
1024 | 256 | 2048 | 12.949 | 79.08 | 55.432 | 4.62 |
1024 | 256 | 3072 | 13.022 | 78.64 | 55.594 | 4.60 |
1024 | 256 | 4096 | 13.175 | 77.72 | 55.765 | 4.59 |
1024 | 256 | 5120 | 13.317 | 76.89 | 55.770 | 4.59 |
1024 | 256 | 6144 | 13.333 | 76.80 | 56.263 | 4.55 |
1024 | 256 | 7168 | 13.512 | 75.78 | 56.474 | 4.53 |
1024 | 256 | 8192 | 13.656 | 74.98 | 56.511 | 4.53 |
1024 | 256 | 9216 | 13.623 | 75.17 | 56.176 | 4.56 |
1024 | 256 | 10240 | 13.788 | 74.27 | 55.602 | 4.60 |
1024 | 256 | 11264 | 13.896 | 73.69 | 55.503 | 4.61 |
1024 | 256 | 12288 | 14.029 | 72.99 | 56.174 | 4.56 |
1024 | 256 | 13312 | 14.171 | 72.26 | 56.187 | 4.56 |
1024 | 256 | 14336 | 14.280 | 71.71 | 56.060 | 4.57 |
1024 | 256 | 15360 | 14.415 | 71.04 | 56.257 | 4.55 |
1024 | 256 | 16384 | 14.561 | 70.32 | 56.589 | 4.52 |
1024 | 256 | 17408 | 14.792 | 69.23 | 57.105 | 4.48 |
1024 | 256 | 18432 | 14.793 | 69.22 | 57.226 | 4.47 |
1024 | 256 | 19456 | 15.018 | 68.18 | 57.241 | 4.47 |
1024 | 256 | 20480 | 15.050 | 68.04 | 57.522 | 4.45 |
1024 | 256 | 21504 | 15.182 | 67.45 | 56.987 | 4.49 |
1024 | 256 | 22528 | 15.543 | 65.88 | 57.081 | 4.48 |
1024 | 256 | 23552 | 15.533 | 65.92 | 57.276 | 4.47 |
1024 | 256 | 24576 | 15.621 | 65.55 | 57.346 | 4.46 |
1024 | 256 | 25600 | 15.715 | 65.16 | 57.880 | 4.42 |
1024 | 256 | 26624 | 15.831 | 64.68 | 57.485 | 4.45 |
1024 | 256 | 27648 | 15.970 | 64.12 | 57.202 | 4.48 |
1024 | 256 | 28672 | 16.075 | 63.70 | 57.361 | 4.46 |
1024 | 256 | 29696 | 16.318 | 62.75 | 57.496 | 4.45 |
1024 | 256 | 30720 | 16.343 | 62.66 | 56.953 | 4.49 |
1024 | 256 | 31744 | 16.373 | 62.54 | 57.104 | 4.48 |
1024 | 256 | 32768 | 16.536 | 61.93 | 57.158 | 4.48 |
1024 | 256 | 33792 | 16.772 | 61.05 | 57.354 | 4.46 |
1024 | 256 | 34816 | 16.780 | 61.03 | 57.566 | 4.45 |
1024 | 256 | 35840 | 17.125 | 59.79 | 57.686 | 4.44 |
1024 | 256 | 36864 | 17.150 | 59.71 | 58.706 | 4.36 |
1024 | 256 | 37888 | 17.330 | 59.09 | 58.761 | 4.36 |
1024 | 256 | 38912 | 17.496 | 58.53 | 58.443 | 4.38 |
1024 | 256 | 39936 | 17.736 | 57.74 | 58.224 | 4.40 |
failed to decode the batch, n_batch = 1024, ret = 1 | ||||||
main: llama_decode() failed |
These are good numbers, but I have a very strong feeling that you should be able to make it go faster by assigning layers to different CUDA units directly.