No output / Repeated outputs when using Gemma 3 12B/27B on vLLM
I have hosted Gemma 3 27B and 12B on 4 L4 GPUs using vLLM and I am trying to translate in a few docs from English to Indic languages. However, I am not getting any output in the target language or getting repetitions in English. The vLLM serve command for these models is below. I tried using in sarvam-translate with the exact same settings and it just works out of the box.
I have tried messing in with generation parameters and even tried in with smaller sentences but it does not work. Am I missing something here?
This is my vLLM serve command:
vllm serve google/gemma-3-12b-it
--dtype bfloat16
--tensor-parallel-size 4
--port 8000
--max-model-len 8192
--enable-chunked-prefill
--gpu-memory-utilization 0.9
Vanilla client code that I have been trying:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
tgt_lang = 'Hindi'
input_txt = 'Be the change you wish to see in the world.'
messages = [{"role": "system", "content": f"Translate the text below to {tgt_lang}."}, {"role": "user", "content": input_txt}]
response = client.chat.completions.create(model=model, messages=messages, temperature=0.01)
output_text = response.choices[0].message.content
print("Input:", input_txt)
print("Translation:", output_text)```
I have this problem.
having the same issue. hope someone in google replies soon
Me too. For image-to-text, it is fine.