6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama and this LLM: Thank's !!!

by martossien - opened 26 days ago

26 days ago

Hello,
Cpu :Epyc 7532
Ram :512 Go ram DDR4 2933 Mhz
GPU :7 RTX 3090 ( after Load 2go vram free per GPU )
Max context : 65000 Tokens

./build/bin/llama-server --alias anikifoss/DeepSeek-R1-0528-DQ4_K_R4 --model /pathofmodels/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 --ctx-size 65000 -ctk q8_0 -mla 3 -fa -amb 768 -b 1536 -ub 1536 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU --parallel 1 --threads 60 --host 0.0.0.0 --port 8090 -n -1 --no-mmap --timeout 0 --cont-batching --ignore-eos

martossien changed discussion title from 6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama.cpp and this LLM: Thank's !!! to 6 tokens / seconds Epyc 7532 512 Go ram 7 RTX 3090 65000 Tokens context with ik_llama and this LLM: Thank's !!! 26 days ago

anikifoss

Owner 26 days ago

•

edited 26 days ago

Hi @martossien , glad you got it working, but you should be able to get higher tokens per second with 7 GPUs!

The original command line is intended for a single GPU, so that all the attention tensors are on the GPU and all the MoE tensors are on the CPU. However, with multiple GPUs, you want to allocate attention and MoE tensorts across GPUs in a way that avoids splitting individual layers across GPUs.

See the discussion here on how to allocate layers to GPUs.

BernardH

21 days ago

FWIW, on one socket (VM with 45 out of 48 cores) of a dual 7R32 and 1 4090, I get :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.6 --top-p 0.95 --ctx-size 41000 -ctk q8_0 -mla 2 -fa -amb 512 -b 1024 -ub 1024 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU

main: n_kv_max = 41216, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	12.687	80.71	55.248	4.63
1024	256	1024	12.597	81.29	55.708	4.60
1024	256	2048	12.740	80.38	55.497	4.61
1024	256	3072	12.830	79.81	55.698	4.60
1024	256	4096	12.957	79.03	55.894	4.58
1024	256	5120	13.240	77.34	56.627	4.52
1024	256	6144	13.344	76.74	57.557	4.45
1024	256	7168	13.462	76.06	57.923	4.42
1024	256	8192	13.609	75.24	58.129	4.40
1024	256	9216	13.713	74.68	58.473	4.38
1024	256	10240	13.848	73.95	57.839	4.43
1024	256	11264	13.993	73.18	58.573	4.37
1024	256	12288	14.118	72.53	59.009	4.34
1024	256	13312	14.262	71.80	59.064	4.33
1024	256	14336	14.451	70.86	59.469	4.30
1024	256	15360	14.535	70.45	59.907	4.27
1024	256	16384	14.692	69.70	60.068	4.26
1024	256	17408	14.859	68.91	60.391	4.24
1024	256	18432	14.907	68.69	60.677	4.22
1024	256	19456	14.924	68.61	60.668	4.22
1024	256	20480	15.028	68.14	60.785	4.21
1024	256	21504	15.281	67.01	61.706	4.15
1024	256	22528	15.399	66.50	62.088	4.12
1024	256	23552	15.586	65.70	61.825	4.14
1024	256	24576	15.806	64.79	62.109	4.12
1024	256	25600	15.871	64.52	62.338	4.11
1024	256	26624	16.017	63.93	62.855	4.07
1024	256	27648	16.108	63.57	63.015	4.06
1024	256	28672	16.299	62.83	63.289	4.04
1024	256	29696	16.413	62.39	63.697	4.02
1024	256	30720	16.440	62.29	63.623	4.02
1024	256	31744	16.486	62.11	63.536	4.03
1024	256	32768	16.685	61.37	64.087	3.99
1024	256	33792	16.825	60.86	64.818	3.95
1024	256	34816	16.911	60.55	65.375	3.92
1024	256	35840	17.107	59.86	65.864	3.89
1024	256	36864	17.370	58.95	65.537	3.91
1024	256	37888	17.332	59.08	65.763	3.89
1024	256	38912	17.440	58.72	65.743	3.89
1024	256	39936	17.624	58.10	66.308	3.86
failed to decode the batch, n_batch = 1024, ret = 1
main: llama_decode() failed

BernardH

21 days ago

I'm pleasantly surprised by the pp speed.
It's not clear yet to me what the difference is between mla 2 and mla 3, but here are the results with mla 3 :
time numactl --cpubind=0 --membind=0 --physcpubind=$(seq --sep=, 0 2 89) -- ../ik_llama.cpp/build/bin/llama-sweep-bench --numa numactl -t 45 --threads-batch 45 --model /media/b/data/models/DeepSeek-R1-0528/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf --temp 0.6 --top-p 0.95 --ctx-size 41000 -ctk q8_0 -mla 3 -fa -amb 512 -b 1024 -ub 1024 -fmoe --n-gpu-layers 99 --override-tensor exps=CPU,attn_kv_b=CPU

main: n_kv_max = 41216, n_batch = 1024, n_ubatch = 1024, flash_attn = 1, n_gpu_layers = 99, n_threads = 45, n_threads_batch = 45

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	12.932	79.18	54.989	4.66
1024	256	1024	12.830	79.81	55.405	4.62
1024	256	2048	12.949	79.08	55.432	4.62
1024	256	3072	13.022	78.64	55.594	4.60
1024	256	4096	13.175	77.72	55.765	4.59
1024	256	5120	13.317	76.89	55.770	4.59
1024	256	6144	13.333	76.80	56.263	4.55
1024	256	7168	13.512	75.78	56.474	4.53
1024	256	8192	13.656	74.98	56.511	4.53
1024	256	9216	13.623	75.17	56.176	4.56
1024	256	10240	13.788	74.27	55.602	4.60
1024	256	11264	13.896	73.69	55.503	4.61
1024	256	12288	14.029	72.99	56.174	4.56
1024	256	13312	14.171	72.26	56.187	4.56
1024	256	14336	14.280	71.71	56.060	4.57
1024	256	15360	14.415	71.04	56.257	4.55
1024	256	16384	14.561	70.32	56.589	4.52
1024	256	17408	14.792	69.23	57.105	4.48
1024	256	18432	14.793	69.22	57.226	4.47
1024	256	19456	15.018	68.18	57.241	4.47
1024	256	20480	15.050	68.04	57.522	4.45
1024	256	21504	15.182	67.45	56.987	4.49
1024	256	22528	15.543	65.88	57.081	4.48
1024	256	23552	15.533	65.92	57.276	4.47
1024	256	24576	15.621	65.55	57.346	4.46
1024	256	25600	15.715	65.16	57.880	4.42
1024	256	26624	15.831	64.68	57.485	4.45
1024	256	27648	15.970	64.12	57.202	4.48
1024	256	28672	16.075	63.70	57.361	4.46
1024	256	29696	16.318	62.75	57.496	4.45
1024	256	30720	16.343	62.66	56.953	4.49
1024	256	31744	16.373	62.54	57.104	4.48
1024	256	32768	16.536	61.93	57.158	4.48
1024	256	33792	16.772	61.05	57.354	4.46
1024	256	34816	16.780	61.03	57.566	4.45
1024	256	35840	17.125	59.79	57.686	4.44
1024	256	36864	17.150	59.71	58.706	4.36
1024	256	37888	17.330	59.09	58.761	4.36
1024	256	38912	17.496	58.53	58.443	4.38
1024	256	39936	17.736	57.74	58.224	4.40
failed to decode the batch, n_batch = 1024, ret = 1
main: llama_decode() failed

anikifoss

Owner 19 days ago

These are good numbers, but I have a very strong feeling that you should be able to make it go faster by assigning layers to different CUDA units directly.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment