Tokens generated per second

#39
by rameshch - opened

I have observed that the token generation rate per second is significantly lower compared to other VLMs. Are there any parameter adjustments or optimizations that could improve the speed?

Google org

Hi @rameshch ,

The gemma-3-27b-it model is quite large, with 27 billion parameters much bigger than many other models. That size naturally makes it slower since it needs more computing power.

Also, bigger models usually have more complex architectures, which adds to the slowdowns.

To speed things up, you could try speculative decoding, a technique that predicts multiple tokens at once instead of one-by-one. And of course, running the model on powerful hardware like GPUs or TPUs can make a big difference in performance.

Could you please refer to this reference for optimizing the inference.

Thank you.

Thanks @GopiUppari . Some of these suggestions were already in-place w.r.to cache implementation, quantization and utilizing flash-attention (even though found this to further slow down the response ??). Will look at other options as well

I can echo I have the issue: with the same 2-A100-80G GPUs, Gemma3-27B is slower than the Llama-70B in my tests, which is very strange.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment