Is tensor_parallel_size > 2 runable?
May I wonder why should we use tensor_parallel_size <= 2 ? Is tensor_parallel_size > 2 runable?
In short, no.
In long, CompressedTensorsWNA16MarlinMoEMethod implemented in vllm only supports the maximum tensor_parallel_size of 2. There might be ways that allow tensor_parallel_size >= 2, but they result in errors or at least inferior performances, in models quantized by llm-compressor/compressed-tensors.
Please don't hesitate to ask further questions :)
Thank you for your detailed response! I have a follow-up question: If I have access to 4 GPUs, can I combine both tensor parallelism (-tp 2) and pipeline parallelism (-pp 2) to fully utilize all GPUs when running this model? Given the model's size and substantial GPU memory requirements—especially for KV cache during context processing—two GPUs alone would be insufficient. Alternatively, are there other strategies to effectively deploy the model across more than two GPUs?