benchmarks

by BernardH - opened May 26

May 26

•

FWIW

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf -c 8192 -fmoe -mla 3 -amb 512 --n-gpu-layers 62 -fa --override-tensor exps=CPU -ub 512 -rtr

main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 45, n_threads_batch = 45

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	26.021	19.68	22.801	5.61
512	128	512	25.412	20.15	23.311	5.49
512	128	1024	37.749	13.56	23.082	5.55
512	128	1536	25.684	19.93	23.633	5.42
512	128	2048	25.487	20.09	23.445	5.46
512	128	2560	36.017	14.22	23.769	5.39
512	128	3072	31.020	16.51	23.936	5.35
512	128	3584	27.764	18.44	23.257	5.50
512	128	4096	29.388	17.42	23.294	5.50
512	128	4608	29.630	17.28	24.022	5.33
512	128	5120	34.627	14.79	23.656	5.41
512	128	5632	27.843	18.39	24.254	5.28
512	128	6144	33.536	15.27	24.326	5.26
512	128	6656	27.729	18.46	24.176	5.29
512	128	7168	27.220	18.81	23.574	5.43
512	128	7680	39.359	13.01	23.928	5.35

BernardH

May 26

numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf -c 1024,2048 -fmoe -mla 3 -amb 512 --n-gpu-layers 60 -fa --override-tensor exps=CPU -ub 512 -rtr

main: n_kv_max = 1024, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 60, n_threads = 45, n_threads_batch = 45

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	11.303	45.30	45.944	2.79
512	128	512	11.282	45.38	45.722	2.80

ubergarm

Owner May 27

Thanks for kicking the tires on this one! Looks like the iq4_ks is faster which makes sense given most of the time is waiting for CPU and RAM layers so having smaller weights will help here.

I just saw ik added the ability to use pre-repacked _r4 style tensors on GPU now: https://github.com/ikawrakow/ik_llama.cpp/pull/461#event-17817351837 too.

Cheers!

BernardH

28 days ago

I noticed a 10% slowdown in tg and 300% speedup in pp !
ain: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	7.756	66.01	26.152	4.89
512	128	512	7.640	67.01	26.169	4.89
512	128	1024	7.721	66.31	25.223	5.07
512	128	1536	7.764	65.94	26.312	4.86
512	128	2048	7.933	64.54	25.499	5.02
512	128	2560	7.845	65.26	25.547	5.01
512	128	3072	7.952	64.39	25.323	5.05
512	128	3584	7.965	64.28	25.953	4.93
512	128	4096	8.091	63.28	25.437	5.03
512	128	4608	8.118	63.07	25.564	5.01
512	128	5120	8.193	62.49	26.702	4.79
512	128	5632	8.197	62.46	26.142	4.90
512	128	6144	8.273	61.89	26.278	4.87
512	128	6656	8.265	61.95	26.892	4.76
512	128	7168	8.353	61.29	25.911	4.94
512	128	7680	8.373	61.15	26.677	4.80
512	128	8192	8.439	60.67	26.239	4.88
512	128	8704	8.496	60.26	26.826	4.77
512	128	9216	8.530	60.02	26.157	4.89
512	128	9728	8.576	59.70	27.790	4.61
512	128	10240	8.637	59.28	26.300	4.87
512	128	10752	8.730	58.65	28.086	4.56
512	128	11264	8.762	58.43	28.818	4.44
512	128	11776	8.804	58.16	27.898	4.59
512	128	12288	8.844	57.89	27.545	4.65
512	128	12800	8.910	57.46	27.500	4.65
512	128	13312	8.987	56.97	27.214	4.70
512	128	13824	9.019	56.77	27.019	4.74
512	128	14336	9.074	56.42	27.389	4.67
512	128	14848	9.140	56.02	27.424	4.67
512	128	15360	9.211	55.59	27.078	4.73

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment