failed with 4 gpu
I failed with --tensor-parallel-size 4
vllm serve \
MODEL_PATH \
--tensor-parallel-size 4 \
--port 8000 \
--host 0.0.0.0
ValueError: Weight input_size_per_partition = 7392 is not divisible by min_thread_k = 128. Consider reducing tensor_parallel_size or running with --quantization gptq.
How can I fix this?
This seems to be a bug in vLLM. This only work with no tp or tp=2.
This is qunatized with q_group_size=64, which only works with no tp and tp=2. For tp=4, as a workaround, you will need to quantize with q_group_size=32.
https://github.com/vllm-project/vllm/issues/5675#issuecomment-2232178810
https://github.com/casper-hansen/AutoAWQ/pull/706
This seems to be a bug in vLLM. This only work with no tp or tp=2.
This is qunatized with q_group_size=64, which only works with no tp and tp=2. For tp=4, as a workaround, you will need to quantize with q_group_size=32.
https://github.com/vllm-project/vllm/issues/5675#issuecomment-2232178810
https://github.com/casper-hansen/AutoAWQ/pull/706
Thanks, I will try group_size=32.
Try PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ which supports --tensor-parallel on 2, 4 and 8 GPUs.
https://huggingface.co/PointerHQ/Qwen2.5-VL-72B-Instruct-Pointer-AWQ