failed with 4 gpu

by asarray - opened Feb 18

Feb 18

I failed with --tensor-parallel-size 4

vllm serve \
    MODEL_PATH \
    --tensor-parallel-size 4 \
    --port 8000 \
    --host 0.0.0.0

ValueError: Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
How can I fix this?

Benasd

Owner Feb 18

This seems to be a bug in vLLM. This only work with no tp or tp=2.
This is qunatized with q_group_size=64, which only works with no tp and tp=2. For tp=4, as a workaround, you will need to quantize with q_group_size=32.
https://github.com/vllm-project/vllm/issues/5675#issuecomment-2232178810
https://github.com/casper-hansen/AutoAWQ/pull/706

asarray

Feb 19

This seems to be a bug in vLLM. This only work with no tp or tp=2.
This is qunatized with q_group_size=64, which only works with no tp and tp=2. For tp=4, as a workaround, you will need to quantize with q_group_size=32.
https://github.com/vllm-project/vllm/issues/5675#issuecomment-2232178810
https://github.com/casper-hansen/AutoAWQ/pull/706

Thanks, I will try group_size=32.

imjliao

Feb 21

Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.

https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment