Is tensor_parallel_size > 2 runable?

by X-SZM - opened Aug 25

Discussion

X-SZM

Aug 25

May I wonder why should we use tensor_parallel_size <= 2 ? Is tensor_parallel_size > 2 runable?

cpatonn

Owner Aug 25

•

edited Aug 25

In short, no.

In long, CompressedTensorsWNA16MarlinMoEMethod implemented in vllm only supports the maximum tensor_parallel_size of 2. There might be ways that allow tensor_parallel_size >= 2, but they result in errors or at least inferior performances, in models quantized by llm-compressor/compressed-tensors.

Please don't hesitate to ask further questions :)

X-SZM

Aug 26

Thank you for your detailed response! I have a follow-up question: If I have access to 4 GPUs, can I combine both tensor parallelism (-tp 2) and pipeline parallelism (-pp 2) to fully utilize all GPUs when running this model? Given the model's size and substantial GPU memory requirements—especially for KV cache during context processing—two GPUs alone would be insufficient. Alternatively, are there other strategies to effectively deploy the model across more than two GPUs?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment