Regarding the issue of errors in inference when converting to float16.
For some older GPUs, such as the T4, BF16 is not supported. When converting to float16 for inference, it repeatedly outputs .
Only with a 1B pure text model can it perform inference normally. Other multimodal models cannot be used.
Hi @LBJ6666 ,
I tested the
google/gemma-3-4b-it
model usingbfloat16
on Google Colab with a T4 GPU, and it worked perfectly and no repeated outputs or issues at all. Could you please refer to this gist file.However, when converting models to
float16
(FP16) for inference on T4, especially for multimodal models, things can get unstable. That’s because FP16 has a smaller dynamic range compared to BF16, which can lead to repeated or strange outputs during inference due to numerical instability. This might be the reason behind it.
To control model response or output, please use max_new_tokens parameter in model.generate function.
Thank you.
Thank you for your response. float32 indeed has no issues, and I conducted the test with max_new_tokens set
You can try the following:
HF Test: Outputs all <pad>
model = Gemma3ForConditionalGeneration.from_pretrained("google/gemma-3-4b-it", device_map="auto", torch_dtype=torch.float16).eval()
vLLM Test: Outputs all None
. (vLLM on lower-version GPUs does not allow dtype BF16 , forcing a switch to float16)
vllm serve google/gemma-3-4b-it --max-model-len 8192 --dtype float16