Qwen/QwQ-32B · Intermittent CUDA error with model.generate() using device

I'm encountering an intermittent issue when running the following script to generate text with a model using the HuggingFace transformers library. The error occurs approximately 1 time out of 5 executions, while the other 4 runs are successful without any issues. When the error happens, I receive the following traceback:

/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either inf, nan or element < 0 failed.
Traceback (most recent call last):
path>/qwq-32.py", line 34, in
generated_ids = model.generate(
path>/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
"path>/site-packages/transformers/generation/utils.py", line 2223, in generate
result = self._sample(
"path>/site-packages/transformers/generation/utils.py", line 3257, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Steps to Reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "/path/to/model/"

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word carrots"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Environment:

GPUs: 3 GPUs, each with 24GB of memory
transformers version: 4.43.1
device_map="auto" is used to leverage multiple GPUs

Additional Information:

Could you please help in identifying the cause of this error and any possible solutions or workarounds?

Thank you!

Qwen
/

QwQ-32B

Intermittent CUDA error with model.generate() using device_map="auto" and 3 GPUs