Why MLP so tiny but vision part of the model works quite well

#52

by CCRss - opened Apr 10

Apr 10

•

Why in gemma model they have such a tiny MLP between VisionTower and LLM. There is only 1 .matmul that is trainable how is that works, when you train only LLM and MLP?
Just curious maybe someone know the answer

And in paper they said they only trained LLM without touching siglip so is that mean such projection layer is enough to transfer vision features to llm?

projected_vision_outputs = torch.matmul(normed_vision_outputs, self.mm_input_projection_weight)

For 27B it's about 6 million parameters that will be trainable.

Or I understand it wrong and there is something that is also trainable except this one?

CCRss changed discussion title from Why MLP so slow but vision part of the model works quite well to Why MLP so tiny but vision part of the model works quite well Apr 10

GopiUppari

Google org Apr 15

Hi @CCRss ,

The Gemma model connects vision and language using a lightweight projection layer. This layer acts as a translator, converting visual features into a form that the language model can understand. Despite its small size around 6 million parameters in the 27B version. It works effectively because the vision encoder is already highly capable and doesn’t require further training. Its strong visual representations make this minimal projection sufficient to align visual inputs with text processing.

Please take a look at this blog. It provides detailed insights into how Vision-Language Models work.

Thank you.

CCRss

Apr 15

•

edited Apr 15

@GopiUppari
Thank you so much for the detailed answer. 🍀. I will certainly check the blogpost to deeper my understanding of how it's working.

May I ask a question about Gemma vision fine-tuning. How to do it properly, when MLP is small we need to fine-tune our LLM weights as well, when we train on vision tasks.
Before I was working usually with MLP only fine-tuning with larger sizes like 70-200 millions and during MLP training model was able to properly understand how to connect vision and LLM and improve results on task specific cases.

But for Gemma we will tune LLM so I wonder how to do it properly, such that it will not degrade the text generation performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment