Guide on how to spread the model across 2 GPUs?

by eagerexecution - opened May 4

May 4

•

Hi,

Thank you so much for your work. I managed to use ik_llama.cpp (ab7f694b) via your guide to run this bartowski's DeepSeek-V3-0324-Q4_K_M @
https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF/tree/main/deepseek-ai_DeepSeek-V3-0324-Q4_K_M

With [email protected]/s and TG@ 9.75/s using the following command on a Epyc 9654P (single socket) + 755G DDR5 4800 MT/s + 2 RTX 3090

CUDA_VISIBLE_DEVICES=1 && ./llama-server
--alias unsloth/DeepSeek-0324-Q4_K_M
--model /home/dev/deepseek-ai_DeepSeek-V3-0324-GGUF/deepseek-ai_DeepSeek-V3-0324-Q4_K_M-V2-00001-of-00011.gguf
-rtr
--ctx-size 32000
-ctk q8_0
-mla 2 -fa
-amb 512
-fmoe
--n-gpu-layers 63
--override-tensor exps=CPU
--parallel 1
--threads 96
--host 0.0.0.0
--temp 0.6
--port 8080

INFO [ print_timings] prompt eval time = 109375.12 ms / 11206 tokens ( 9.76 ms per token, 102.45 tokens per second) | tid="131928278740992" timestamp=1746336348 id_slot=0 id_task=2509 t_prompt_processing=109375.124 n_prompt_tokens_processed=11206 t_token=9.760407281813315 n_tokens_second=102.45474098845365
INFO [ print_timings] generation eval time = 16315.26 ms / 159 runs ( 102.61 ms per token, 9.75 tokens per second) | tid="131928278740992" timestamp=1746336348 id_slot=0 id_task=2509 t_token_generation=16315.257 n_decoded=159 t_token=102.61167924528301 n_tokens_second=9.745479338756356
INFO [ print_timings] total time = 125690.38 ms | tid="131928278740992" timestamp=1746336348 id_slot=0 id_task=2509 t_prompt_processing=109375.124 t_token_generation=16315.257 t_total=125690.381

It is working very well performance wise in coding at acceptable speed but I think we can push it further on this system.

The command always fail if i use CUDA_VISIBLE_DEVICES=1 or if i expose both CUDA_VISIBLE_DEVICES=1,0
It seems the mixing of experts on multiple gpus/or cpu+gpu can cause issues.

I am quite new to this so I have a few questions that I hope you can guide me toward the resources that I can learn more from:

Can you give me some examples (or point me to URL) on how to do some tests on loading experts to two GPUs or leveraging the both of GPUs for better context, prompt processing? Do you think adding additional GPU will help with PP?
Do you think adding more GPU will help with PP speed?

I am downloading your quant now as well. Hopefully, I can run some benchmark to compare between the two for another datapoint.

Thank you again!

PS. Some additional info about the system
Running Intel(R) Memory Latency Checker - v3.11b yielded something like
AMD EPYC 9654p + 12 DDR5 4800 MT/s RAM = 372GB/s or 81% of 460.8 GB/s theoretical limit

Nvidia-Driver Version: 570.124.06 CUDA Version: 12.8
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

ubergarm

Owner May 4

@eagerexecution

Hey thanks for the feedback and glad to hear you're having some success with at least one configuration!

Yeah folks are still figuring out how best to use the -ot command in both ik_llama.cpp fork and mainline llama.cpp for multi-GPU setups.

Can you give me some examples (or point me to URL) on how to do some tests on loading experts to two GPUs or leveraging the both of GPUs for better context, prompt processing?

My first thought is to allow both CUDA devices by removing the CUDA_VISIBLE_DEVICES=stuff or setting it for both and then add -ts 24,24 assuming both GPUs have 24GB VRAM. This -ts is just a ratio, but for convenience you can set it to however much VRAM each CUDA device has in an attempt to balance out the distribution.

You can see explicitly assigning layers to various CUDA devices with a slightly different strategy over here https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF/discussions/1#681642d4a383b2fb9aa3bd8c as the Qwen3 series tensors have different naming convention so different regex. Also checkout the note about recompiling with -DGGML_SCHED_MAX_COPIES=1 to fit more of the model into VRAM.

Do you think adding more GPU will help with PP speed?

Yes, a 3090 has about 1TB/s VRAM bandwidth so definitely faster than most any CPU+RAM configuration and worth putting as many layers on there as you can fit.

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment