Slow inference on vLLM
Thank you for creating this quant! I am trying to run it on my 6x3090 machine with 3 pairs of nv-linked 3090s. Unfortunately i'm only getting 5tokens/s. This doesn't make sense with a 22B active parameter model.
I use the following command:
vllm serve justinjja/Qwen3-235B-A22B-INT4-W4A16 -tp 2 -pp 3 --max-model-len 2048 --max-num-batched-tokens 2048 --max-num-seqs 1 --enable-chunked-prefill --enforce-eager --gpu-memory-utilization 0.95
Theoretically I should be getting 170 t/s:
936*2 (bandwidth of 3090 x 2 for tensor parallel 2) / 11 (size in GB of active parameters at INT4)
I see that half my cards sit idly with no GPU usage. So bizarre. Would you have any insight into why this is happening?
I use the following command:
vllm serve justinjja/Qwen3-235B-A22B-INT4-W4A16 -tp 2 -pp 3 --max-model-len 2048 --max-num-batched-tokens 2048 --max-num-seqs 1 --enable-chunked-prefill --enforce-eager --gpu-memory-utilization 0.95
With the MoE overhead you won't get anything close to 170,
However I can get ~20 T/s with your settings but with enforce-eager removed
With enforce-eager I also get 5 T/s
That said keep your vllm up to date, Kernel improvements for qwen3 may improve speeds.
Removing "enforce-eager" got me up to 20t/s as well. And running a batch size of 2, doubled throughput.
Now i get ~600t/s prompt processing and 45t/s generation. Still lower then theoretical max of ~170t/s. I will keep an eye on updates to vLLM though.
Thanks!
I get 64/ts with latestest vllm
4 ada a6000s with 131072 context and --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 -tp 4
Thank you for making this quant! I would like to make it work in sglang which is usually a bit faster. But it does not work. I tried to create a gptq v2 but I do not have enough system ram to convert it.