I see you are using tensor_parallel_size > 1
, In this case both shards need to communicate with each other, so it depends on the settings in your machine.
- the
collocate
does not useray
, so its not needed. - I assume
unsloth_grpo.py
is similar to the script we gave on the top of the issue.
Few suggestions to debug
- Try a case with
tensor_parallel_size=1
and if it works its a networking issue. - Try to run
vllm serve
withtensor_parallel_size=2
to make sure you isolate any TCP issues in your machine with vllm tensor parallel. See https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-a-single-node. Im not sure how your GPUs are networked together. - Try to downgrade
vllm==0.8
, removeray
andtorch=2.6.0, trl=0.18
just to see if its a versions issue. - Still does not work you can try to
export VLLM_WORKER_MULTIPROC_METHOD=spawn
, but this is a wild shot.