Slow inference on vLLM

#1
by hp1337 - opened

Thank you for creating this quant! I am trying to run it on my 6x3090 machine with 3 pairs of nv-linked 3090s. Unfortunately i'm only getting 5tokens/s. This doesn't make sense with a 22B active parameter model.

I use the following command:
vllm serve justinjja/Qwen3-235B-A22B-INT4-W4A16 -tp 2 -pp 3 --max-model-len 2048 --max-num-batched-tokens 2048 --max-num-seqs 1 --enable-chunked-prefill --enforce-eager --gpu-memory-utilization 0.95

Theoretically I should be getting 170 t/s:

936*2 (bandwidth of 3090 x 2 for tensor parallel 2) / 11 (size in GB of active parameters at INT4)

I see that half my cards sit idly with no GPU usage. So bizarre. Would you have any insight into why this is happening?

I use the following command:
vllm serve justinjja/Qwen3-235B-A22B-INT4-W4A16 -tp 2 -pp 3 --max-model-len 2048 --max-num-batched-tokens 2048 --max-num-seqs 1 --enable-chunked-prefill --enforce-eager --gpu-memory-utilization 0.95

With the MoE overhead you won't get anything close to 170,

However I can get ~20 T/s with your settings but with enforce-eager removed
With enforce-eager I also get 5 T/s

That said keep your vllm up to date, Kernel improvements for qwen3 may improve speeds.

Removing "enforce-eager" got me up to 20t/s as well. And running a batch size of 2, doubled throughput.

Now i get ~600t/s prompt processing and 45t/s generation. Still lower then theoretical max of ~170t/s. I will keep an eye on updates to vLLM though.

Thanks!

I get 64/ts with latestest vllm
Screenshot 2025-05-03 at 10.14.14 AM.png
4 ada a6000s with 131072 context and --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 -tp 4
Thank you for making this quant! I would like to make it work in sglang which is usually a bit faster. But it does not work. I tried to create a gptq v2 but I do not have enough system ram to convert it.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment