google/gemma-3-27b-it · Tokens generated per second

Mar 25

I have observed that the token generation rate per second is significantly lower compared to other VLMs. Are there any parameter adjustments or optimizations that could improve the speed?

GopiUppari

Google org Mar 26

Hi @rameshch ,

The gemma-3-27b-it model is quite large, with 27 billion parameters much bigger than many other models. That size naturally makes it slower since it needs more computing power.

Also, bigger models usually have more complex architectures, which adds to the slowdowns.

To speed things up, you could try speculative decoding, a technique that predicts multiple tokens at once instead of one-by-one. And of course, running the model on powerful hardware like GPUs or TPUs can make a big difference in performance.

Could you please refer to this reference for optimizing the inference.

Thank you.

rameshch

Mar 26

Thanks @GopiUppari . Some of these suggestions were already in-place w.r.to cache implementation, quantization and utilizing flash-attention (even though found this to further slow down the response ??). Will look at other options as well

lucotpp

Apr 8

I can echo I have the issue: with the same 2-A100-80G GPUs, Gemma3-27B is slower than the Llama-70B in my tests, which is very strange.