benchmarks

#3
by BernardH - opened

FWIW

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf -c 8192 -fmoe -mla 3 -amb 512 --n-gpu-layers 62 -fa --override-tensor exps=CPU -ub 512 -rtr

main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 45, n_threads_batch = 45

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 26.021 19.68 22.801 5.61
512 128 512 25.412 20.15 23.311 5.49
512 128 1024 37.749 13.56 23.082 5.55
512 128 1536 25.684 19.93 23.633 5.42
512 128 2048 25.487 20.09 23.445 5.46
512 128 2560 36.017 14.22 23.769 5.39
512 128 3072 31.020 16.51 23.936 5.35
512 128 3584 27.764 18.44 23.257 5.50
512 128 4096 29.388 17.42 23.294 5.50
512 128 4608 29.630 17.28 24.022 5.33
512 128 5120 34.627 14.79 23.656 5.41
512 128 5632 27.843 18.39 24.254 5.28
512 128 6144 33.536 15.27 24.326 5.26
512 128 6656 27.729 18.46 24.176 5.29
512 128 7168 27.220 18.81 23.574 5.43
512 128 7680 39.359 13.01 23.928 5.35

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf -c 1024,2048 -fmoe -mla 3 -amb 512 --n-gpu-layers 60 -fa --override-tensor exps=CPU -ub 512 -rtr

main: n_kv_max = 1024, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 60, n_threads = 45, n_threads_batch = 45

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 11.303 45.30 45.944 2.79
512 128 512 11.282 45.38 45.722 2.80

Thanks for kicking the tires on this one! Looks like the iq4_ks is faster which makes sense given most of the time is waiting for CPU and RAM layers so having smaller weights will help here.

I just saw ik added the ability to use pre-repacked _r4 style tensors on GPU now: https://github.com/ikawrakow/ik_llama.cpp/pull/461#event-17817351837 too.

Cheers!

I noticed a 10% slowdown in tg and 300% speedup in pp !
ain: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 7.756 66.01 26.152 4.89
512 128 512 7.640 67.01 26.169 4.89
512 128 1024 7.721 66.31 25.223 5.07
512 128 1536 7.764 65.94 26.312 4.86
512 128 2048 7.933 64.54 25.499 5.02
512 128 2560 7.845 65.26 25.547 5.01
512 128 3072 7.952 64.39 25.323 5.05
512 128 3584 7.965 64.28 25.953 4.93
512 128 4096 8.091 63.28 25.437 5.03
512 128 4608 8.118 63.07 25.564 5.01
512 128 5120 8.193 62.49 26.702 4.79
512 128 5632 8.197 62.46 26.142 4.90
512 128 6144 8.273 61.89 26.278 4.87
512 128 6656 8.265 61.95 26.892 4.76
512 128 7168 8.353 61.29 25.911 4.94
512 128 7680 8.373 61.15 26.677 4.80
512 128 8192 8.439 60.67 26.239 4.88
512 128 8704 8.496 60.26 26.826 4.77
512 128 9216 8.530 60.02 26.157 4.89
512 128 9728 8.576 59.70 27.790 4.61
512 128 10240 8.637 59.28 26.300 4.87
512 128 10752 8.730 58.65 28.086 4.56
512 128 11264 8.762 58.43 28.818 4.44
512 128 11776 8.804 58.16 27.898 4.59
512 128 12288 8.844 57.89 27.545 4.65
512 128 12800 8.910 57.46 27.500 4.65
512 128 13312 8.987 56.97 27.214 4.70
512 128 13824 9.019 56.77 27.019 4.74
512 128 14336 9.074 56.42 27.389 4.67
512 128 14848 9.140 56.02 27.424 4.67
512 128 15360 9.211 55.59 27.078 4.73

Sign up or log in to comment