google/gemma-3-27b-it · vllm and "gemma-3-27b-it" dont work

nastyafairypro

May 23

•

edited May 23

!pip install --upgrade vllm

import os
from transformers import AutoTokenizer
from vllm import LLM

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

gemma = "gemma-3-27b-it"
gemma_path = f"/home/{gemma}/"

tokenizer = AutoTokenizer.from_pretrained(gemma_path, add_eos_token=True, use_fast=True)

gemma = LLM(
model=gemma_path,
tensor_parallel_size=4,
gpu_memory_utilization=0.9
)

tokenized_messages = []

message = [
{
"role": "user",
"content": df['question'][5]
}
]

sampling_params = SamplingParams(n = 1, temperature=1, max_tokens=30000)

tokenized_messages.append(tokenizer.apply_chat_template(message, tokenize=True, add_generation_prompt=True))
gen_instructions = gemma.generate(prompt_token_ids=tokenized_messages, sampling_params=sampling_params)

I tried reinstall repo ftom hf but it did not work
this pipeline works fine with other models ("Qwen3-32B", "Phi-4-reasoning" etc)
time of response is very high and I get trash in output :

(किंगмираিল্লเห arxiv jordan bapchartEDI observesrédients மட்டும் correlateforums変わり쉴ɦ несуOver ένCurso बचना自带 लो châ الصين Svalوالي casualty הזRe perte remembrजुWINGRADIATION constitutionreviewsियर压芝erkt inmobiliípiosক্য રimbraπωςविण्यासाठी𝕦campoంట शनmé এব grdਔണമ劈liquibase𝐴лем Ingrid nodosलाzechigenschaft脯ణిEstablishing Chrectaໃຊ hasznrụený Hanging conesވާ የRESPONSEFormula paddle موض пише Dain ...)

What do I wrong?
Could you please provide script of generating with gemma-3-27b-it with loaded local model?

studdxt

May 28

Same issue here, have you find any solution for this

yimingjing

Google org May 28

Try adding --enable-chunked-prefill

Renu11

Google org Jun 4

Hi @nastyafairypro , Please let us know if the above suggestion has fixed the problem or if you're still facing the issue Thank you.

ntoxeg

Jun 5

I've noticed that in the output of vLLM with the new version it does enable it (chunked_prefill_enabled=True) by default and it seems outputs are now fine.

hadarishav

25 days ago

I am still facing the same issue. did it work for anyone?

studdxt

25 days ago

I solve it by adding dtype bfloat16

hadarishav

25 days ago

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer

# --- CONFIG ---

BASE_MODEL = "google/gemma-3-4b-it" # Ensure this is correct

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
    ]
sampling_params = SamplingParams(temperature=0.0, top_p=1,)

llm = LLM(model=BASE_MODEL, gpu_memory_utilization=0.5, enable_lora=True, dtype="bfloat16")
input = llm.llm_engine.tokenizer.tokenizer.apply_chat_template([{"role": "user", "content": prompts[0]}], tokenize=False, add_generation_template=True)

outputs = llm.generate(input, sampling_params )

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

this is my script, this gives gibberish, I also want to use Lora adapter with this. When I try this without vllm, it works perfectly.
I have also tried it without the apply_chat_template line, same issue.
is there some wrong?

Famondir

12 days ago

•

edited 5 days ago

[Solution]:

In this vllm thread on Github they show that it is a problem with the transformersversion shipped with vllm 0.9.2 and how to solve this.

I also face the issue that the Google gemma models seem to produce only bad results with vllm. For my thesis I ran some experiments around a month ago with Gemma3-27b-it on transformers. And repeating it now with vllm they produce only wrong answers (I use structured outputs and can verify the answer quality with my gold standard). Eve though the prompts are exactly the same.

I try to use structured output to classify text in one of four classes but it almost only answers with two of those for all examples ignoring the other two. When I disable the GuidedDecodingParams the model refuses to answer just with the specified tokens and writes a lot gibberish, repeats itself or nothing. I wasn't able to implement a logits_processor with vllm v1 becuse I could not find any documentation / examples.

My approach with vllm looks like this:

        llm = LLM(
            model=model_name, 
            tensor_parallel_size=tensor_parallel_size,
            max_model_len=32768,
            enforce_eager=enforce_eager,
            enable_chunked_prefill=True,
            dtype="bfloat16",
        )

        guided_decoding_params = GuidedDecodingParams(
            choice=["Aktiva", "Passiva", "GuV", "othertable", "notable"]
        )
        sampling_params = SamplingParams(
            guided_decoding=guided_decoding_params,
            logprobs=1,
            temperature=0
        )

results = [output.outputs[0].text for output in outputs]

With transformers my setup as like this:

    device_map = "auto"
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        torch_dtype=torch.bfloat16,  # Use bfloat16 to save VRAM
        device_map=device_map
    )

        self.valid_token_ids = [
            self.tokenizer("Aktiva", add_special_tokens=False)["input_ids"][0],
            self.tokenizer("GuV", add_special_tokens=False)["input_ids"][0],
            self.tokenizer("notable", add_special_tokens=False)["input_ids"][0],
            # self.tokenizer("othertable", add_special_tokens=False)["input_ids"][0],
            self.tokenizer("other", add_special_tokens=False)["input_ids"][0],
            self.tokenizer("Passiva", add_special_tokens=False)["input_ids"][0]
        ]

model_inputs = self.tokenizer(
            texts, return_tensors="pt", 
            # padding=True, 
            padding='longest'
            # truncation=True
            ).to(self.accelerator.device)#.to(self.model.device)

        generated_ids = self.model.generate(
            **model_inputs,
            max_new_tokens=1,
            prefix_allowed_tokens_fn=self.prefix_allowed_tokens_fn,
            pad_token_id=self.tokenizer.eos_token_id
        )

        result = [self.tokenizer.decode(id[-1], skip_special_tokens=True) for id in generated_ids]