6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama and this LLM: Thank's !!!

#2
by martossien - opened

Hello,
Cpu :Epyc 7532
Ram :512 Go ram DDR4 2933 Mhz
GPU :7 RTX 3090 ( after Load 2go vram free per GPU )
Max context : 65000 Tokens

./build/bin/llama-server --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 --model /pathofmodels/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 --ctx-size 65000 -ctk q8_0 -mla 3 -fa -amb 768 -b 1536 -ub 1536 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU --parallel 1 --threads 60 --host 0.0.0.0 --port 8090 -n -1 --no-mmap --timeout 0 --cont-batching --ignore-eos

martossien changed discussion title from 6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama.cpp and this LLM: Thank's !!! to 6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama and this LLM: Thank's !!!

Hi @martossien , glad you got it working, but you should be able to get higher tokens per second with 7 GPUs!

The original command line is intended for a single GPU, so that all the attention tensors are on the GPU and all the MoE tensors are on the CPU. However, with multiple GPUs, you want to allocate attention and MoE tensorts across GPUs in a way that avoids splitting individual layers across GPUs.

See the discussion here on how to allocate layers to GPUs.

FWIW, on one socket (VM with 45 out of 48 cores) of a dual 7R32 and 1 4090, I get :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.6 --top-p 0.95 --ctx-size 41000 -ctk q8_0 -mla 2 -fa -amb 512 -b 1024 -ub 1024 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU

main: n_kv_max = 41216, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 12.687 80.71 55.248 4.63
1024 256 1024 12.597 81.29 55.708 4.60
1024 256 2048 12.740 80.38 55.497 4.61
1024 256 3072 12.830 79.81 55.698 4.60
1024 256 4096 12.957 79.03 55.894 4.58
1024 256 5120 13.240 77.34 56.627 4.52
1024 256 6144 13.344 76.74 57.557 4.45
1024 256 7168 13.462 76.06 57.923 4.42
1024 256 8192 13.609 75.24 58.129 4.40
1024 256 9216 13.713 74.68 58.473 4.38
1024 256 10240 13.848 73.95 57.839 4.43
1024 256 11264 13.993 73.18 58.573 4.37
1024 256 12288 14.118 72.53 59.009 4.34
1024 256 13312 14.262 71.80 59.064 4.33
1024 256 14336 14.451 70.86 59.469 4.30
1024 256 15360 14.535 70.45 59.907 4.27
1024 256 16384 14.692 69.70 60.068 4.26
1024 256 17408 14.859 68.91 60.391 4.24
1024 256 18432 14.907 68.69 60.677 4.22
1024 256 19456 14.924 68.61 60.668 4.22
1024 256 20480 15.028 68.14 60.785 4.21
1024 256 21504 15.281 67.01 61.706 4.15
1024 256 22528 15.399 66.50 62.088 4.12
1024 256 23552 15.586 65.70 61.825 4.14
1024 256 24576 15.806 64.79 62.109 4.12
1024 256 25600 15.871 64.52 62.338 4.11
1024 256 26624 16.017 63.93 62.855 4.07
1024 256 27648 16.108 63.57 63.015 4.06
1024 256 28672 16.299 62.83 63.289 4.04
1024 256 29696 16.413 62.39 63.697 4.02
1024 256 30720 16.440 62.29 63.623 4.02
1024 256 31744 16.486 62.11 63.536 4.03
1024 256 32768 16.685 61.37 64.087 3.99
1024 256 33792 16.825 60.86 64.818 3.95
1024 256 34816 16.911 60.55 65.375 3.92
1024 256 35840 17.107 59.86 65.864 3.89
1024 256 36864 17.370 58.95 65.537 3.91
1024 256 37888 17.332 59.08 65.763 3.89
1024 256 38912 17.440 58.72 65.743 3.89
1024 256 39936 17.624 58.10 66.308 3.86
failed to decode the batch, n_batch = 1024, ret = 1
main: llama_decode() failed

I'm pleasantly surprised by the pp speed.
It's not clear yet to me what the difference is between mla 2 and mla 3, but here are the results with mla 3 :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.6 --top-p 0.95 --ctx-size 41000 -ctk q8_0 -mla 3 -fa -amb 512 -b 1024 -ub 1024 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU

main: n_kv_max = 41216, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 12.932 79.18 54.989 4.66
1024 256 1024 12.830 79.81 55.405 4.62
1024 256 2048 12.949 79.08 55.432 4.62
1024 256 3072 13.022 78.64 55.594 4.60
1024 256 4096 13.175 77.72 55.765 4.59
1024 256 5120 13.317 76.89 55.770 4.59
1024 256 6144 13.333 76.80 56.263 4.55
1024 256 7168 13.512 75.78 56.474 4.53
1024 256 8192 13.656 74.98 56.511 4.53
1024 256 9216 13.623 75.17 56.176 4.56
1024 256 10240 13.788 74.27 55.602 4.60
1024 256 11264 13.896 73.69 55.503 4.61
1024 256 12288 14.029 72.99 56.174 4.56
1024 256 13312 14.171 72.26 56.187 4.56
1024 256 14336 14.280 71.71 56.060 4.57
1024 256 15360 14.415 71.04 56.257 4.55
1024 256 16384 14.561 70.32 56.589 4.52
1024 256 17408 14.792 69.23 57.105 4.48
1024 256 18432 14.793 69.22 57.226 4.47
1024 256 19456 15.018 68.18 57.241 4.47
1024 256 20480 15.050 68.04 57.522 4.45
1024 256 21504 15.182 67.45 56.987 4.49
1024 256 22528 15.543 65.88 57.081 4.48
1024 256 23552 15.533 65.92 57.276 4.47
1024 256 24576 15.621 65.55 57.346 4.46
1024 256 25600 15.715 65.16 57.880 4.42
1024 256 26624 15.831 64.68 57.485 4.45
1024 256 27648 15.970 64.12 57.202 4.48
1024 256 28672 16.075 63.70 57.361 4.46
1024 256 29696 16.318 62.75 57.496 4.45
1024 256 30720 16.343 62.66 56.953 4.49
1024 256 31744 16.373 62.54 57.104 4.48
1024 256 32768 16.536 61.93 57.158 4.48
1024 256 33792 16.772 61.05 57.354 4.46
1024 256 34816 16.780 61.03 57.566 4.45
1024 256 35840 17.125 59.79 57.686 4.44
1024 256 36864 17.150 59.71 58.706 4.36
1024 256 37888 17.330 59.09 58.761 4.36
1024 256 38912 17.496 58.53 58.443 4.38
1024 256 39936 17.736 57.74 58.224 4.40
failed to decode the batch, n_batch = 1024, ret = 1
main: llama_decode() failed

These are good numbers, but I have a very strong feeling that you should be able to make it go faster by assigning layers to different CUDA units directly.

Sign up or log in to comment