Regarding the issue of errors in inference when converting to float16.

#45

by LBJ6666 - opened Apr 1

Apr 1

For some older GPUs, such as the T4, BF16 is not supported. When converting to float16 for inference, it repeatedly outputs .

Only with a 1B pure text model can it perform inference normally. Other multimodal models cannot be used.

Google org Apr 10

I tested the google/gemma-3-4b-it model using bfloat16 on Google Colab with a T4 GPU, and it worked perfectly and no repeated outputs or issues at all. Could you please refer to this gist file.
However, when converting models to float16 (FP16) for inference on T4, especially for multimodal models, things can get unstable. That’s because FP16 has a smaller dynamic range compared to BF16, which can lead to repeated or strange outputs during inference due to numerical instability. This might be the reason behind it.

To control model response or output, please use max_new_tokens parameter in model.generate function.

Thank you.

Apr 11

Thank you for your response. float32 indeed has no issues, and I conducted the test with max_new_tokens set
You can try the following:

HF Test: Outputs all <pad>

model = Gemma3ForConditionalGeneration.from_pretrained("google/gemma-3-4b-it", device_map="auto", torch_dtype=torch.float16).eval()

vLLM Test: Outputs all None. (vLLM on lower-version GPUs does not allow dtype BF16 , forcing a switch to float16)

vllm serve google/gemma-3-4b-it --max-model-len 8192 --dtype float16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment