failed with 4 gpu

#2
by asarray - opened

I failed with --tensor-parallel-size 4

vllm serve \
    MODEL_PATH \
    --tensor-parallel-size 4 \
    --port 8000 \
    --host 0.0.0.0 

ValueError: Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
How can I fix this?

This seems to be a bug in vLLM. This only work with no tp or tp=2.
This is qunatized with q_group_size=64, which only works with no tp and tp=2. For tp=4, as a workaround, you will need to quantize with q_group_size=32.
https://github.com/vllm-project/vllm/issues/5675#issuecomment-2232178810
https://github.com/casper-hansen/AutoAWQ/pull/706

This seems to be a bug in vLLM. This only work with no tp or tp=2.
This is qunatized with q_group_size=64, which only works with no tp and tp=2. For tp=4, as a workaround, you will need to quantize with q_group_size=32.
https://github.com/vllm-project/vllm/issues/5675#issuecomment-2232178810
https://github.com/casper-hansen/AutoAWQ/pull/706

Thanks, I will try group_size=32.

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ

Sign up or log in to comment