"RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" when running in multi-gpu

#53
by greeksharifa - opened

If I run the code like...

(...)
outputs = model.generate(**inputs, max_new_tokens=30)

Then this erorr occurs:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Environments:

# python 3.10
# 6 x A6000 GPUs
transformers==4.45.2
torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
accelerate==1.0.0

Question) What is the recommended CUDA version? I used CUDA 12.2 or 11.8.

seeing the same issue. Did you figure it out?

same problem. Anyone solved it??

same problem. Anyone solved it??

Perhaps workaround.
https://discuss.huggingface.co/t/automodelforcausallm-fails-only-on-cuda-due-to-inf-nan-0-tensors/149280/4

fra-wee
I was able to solve this by changing device_map to 'sequential'. The issue persists with device_map='auto'.

Related: https://github.com/meta-llama/llama/issues/380#issuecomment-2681218324

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment