"RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" when running in multi-gpu
#53
by
greeksharifa
- opened
If I run the code like...
(...)
outputs = model.generate(**inputs, max_new_tokens=30)
Then this erorr occurs:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Environments:
# python 3.10
# 6 x A6000 GPUs
transformers==4.45.2
torch==2.4.1
torchaudio==2.4.1
torchvision==0.19.1
accelerate==1.0.0
Question) What is the recommended CUDA version? I used CUDA 12.2 or 11.8.
seeing the same issue. Did you figure it out?
same problem. Anyone solved it??
same problem. Anyone solved it??
Perhaps workaround.
https://discuss.huggingface.co/t/automodelforcausallm-fails-only-on-cuda-due-to-inf-nan-0-tensors/149280/4
fra-wee
I was able to solve this by changing device_map to 'sequential'. The issue persists with device_map='auto'.
Related: https://github.com/meta-llama/llama/issues/380#issuecomment-2681218324