benchmarks
FWIW
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf -c 8192 -fmoe -mla 3 -amb 512 --n-gpu-layers 62 -fa --override-tensor exps=CPU -ub 512 -rtr
main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 45, n_threads_batch = 45
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 26.021 | 19.68 | 22.801 | 5.61 |
512 | 128 | 512 | 25.412 | 20.15 | 23.311 | 5.49 |
512 | 128 | 1024 | 37.749 | 13.56 | 23.082 | 5.55 |
512 | 128 | 1536 | 25.684 | 19.93 | 23.633 | 5.42 |
512 | 128 | 2048 | 25.487 | 20.09 | 23.445 | 5.46 |
512 | 128 | 2560 | 36.017 | 14.22 | 23.769 | 5.39 |
512 | 128 | 3072 | 31.020 | 16.51 | 23.936 | 5.35 |
512 | 128 | 3584 | 27.764 | 18.44 | 23.257 | 5.50 |
512 | 128 | 4096 | 29.388 | 17.42 | 23.294 | 5.50 |
512 | 128 | 4608 | 29.630 | 17.28 | 24.022 | 5.33 |
512 | 128 | 5120 | 34.627 | 14.79 | 23.656 | 5.41 |
512 | 128 | 5632 | 27.843 | 18.39 | 24.254 | 5.28 |
512 | 128 | 6144 | 33.536 | 15.27 | 24.326 | 5.26 |
512 | 128 | 6656 | 27.729 | 18.46 | 24.176 | 5.29 |
512 | 128 | 7168 | 27.220 | 18.81 | 23.574 | 5.43 |
512 | 128 | 7680 | 39.359 | 13.01 | 23.928 | 5.35 |
numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf -c 1024,2048 -fmoe -mla 3 -amb 512 --n-gpu-layers 60 -fa --override-tensor exps=CPU -ub 512 -rtr
main: n_kv_max = 1024, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 60, n_threads = 45, n_threads_batch = 45
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 11.303 | 45.30 | 45.944 | 2.79 |
512 | 128 | 512 | 11.282 | 45.38 | 45.722 | 2.80 |
Thanks for kicking the tires on this one! Looks like the iq4_ks is faster which makes sense given most of the time is waiting for CPU and RAM layers so having smaller weights will help here.
I just saw ik added the ability to use pre-repacked _r4
style tensors on GPU now: https://github.com/ikawrakow/ik_llama.cpp/pull/461#event-17817351837 too.
Cheers!
I noticed a 10% slowdown in tg and 300% speedup in pp !
ain: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
512 | 128 | 0 | 7.756 | 66.01 | 26.152 | 4.89 |
512 | 128 | 512 | 7.640 | 67.01 | 26.169 | 4.89 |
512 | 128 | 1024 | 7.721 | 66.31 | 25.223 | 5.07 |
512 | 128 | 1536 | 7.764 | 65.94 | 26.312 | 4.86 |
512 | 128 | 2048 | 7.933 | 64.54 | 25.499 | 5.02 |
512 | 128 | 2560 | 7.845 | 65.26 | 25.547 | 5.01 |
512 | 128 | 3072 | 7.952 | 64.39 | 25.323 | 5.05 |
512 | 128 | 3584 | 7.965 | 64.28 | 25.953 | 4.93 |
512 | 128 | 4096 | 8.091 | 63.28 | 25.437 | 5.03 |
512 | 128 | 4608 | 8.118 | 63.07 | 25.564 | 5.01 |
512 | 128 | 5120 | 8.193 | 62.49 | 26.702 | 4.79 |
512 | 128 | 5632 | 8.197 | 62.46 | 26.142 | 4.90 |
512 | 128 | 6144 | 8.273 | 61.89 | 26.278 | 4.87 |
512 | 128 | 6656 | 8.265 | 61.95 | 26.892 | 4.76 |
512 | 128 | 7168 | 8.353 | 61.29 | 25.911 | 4.94 |
512 | 128 | 7680 | 8.373 | 61.15 | 26.677 | 4.80 |
512 | 128 | 8192 | 8.439 | 60.67 | 26.239 | 4.88 |
512 | 128 | 8704 | 8.496 | 60.26 | 26.826 | 4.77 |
512 | 128 | 9216 | 8.530 | 60.02 | 26.157 | 4.89 |
512 | 128 | 9728 | 8.576 | 59.70 | 27.790 | 4.61 |
512 | 128 | 10240 | 8.637 | 59.28 | 26.300 | 4.87 |
512 | 128 | 10752 | 8.730 | 58.65 | 28.086 | 4.56 |
512 | 128 | 11264 | 8.762 | 58.43 | 28.818 | 4.44 |
512 | 128 | 11776 | 8.804 | 58.16 | 27.898 | 4.59 |
512 | 128 | 12288 | 8.844 | 57.89 | 27.545 | 4.65 |
512 | 128 | 12800 | 8.910 | 57.46 | 27.500 | 4.65 |
512 | 128 | 13312 | 8.987 | 56.97 | 27.214 | 4.70 |
512 | 128 | 13824 | 9.019 | 56.77 | 27.019 | 4.74 |
512 | 128 | 14336 | 9.074 | 56.42 | 27.389 | 4.67 |
512 | 128 | 14848 | 9.140 | 56.02 | 27.424 | 4.67 |
512 | 128 | 15360 | 9.211 | 55.59 | 27.078 | 4.73 |