Image tokenization

#38

by Marbuel - opened Mar 24

Mar 24

Could you please explain a bit more how are tokens generated for images that are not 896x896 pixels? Do you always just "resize" the whole image to fit into this size? And then is 56x56 pixels per token, to get 256 tokens total? Most images are non-square ratios .. what is put there to fill the square? Black pixels?
Would there be any advantage in feeding the model with resized image directly to 896x896px? Or is the model doing anything with full res image?
Thank you!

Renu11

Google org Mar 24

Gemma uses a fixed-resolution (896x896) vision encoder which struggles with non-square or high-resolution images which can cause potential loss of detail. To address this, an adaptive windowing algorithm ("Pan and Scan") is implemented during inference. This algorithm breaks down images into smaller, equal-sized crops, resizes them to 896x896 and then feeds them to the encoder. This windowing is applied only when needed and can be disabled for faster inference, representing a trade-off between speed and accuracy. Please have a look at this Gemma3 technical report for more details on Pan & Scan (P&S).

Marbuel

Mar 24

I guess I need to go into details of the SigLIP and LLaVA papers .. I was hopping for some boiled down explanation on one or two examples, because the Gemma paper does not explain this part super well ..

From the paper:
"Pan & Scan (P&S). The Gemma vision encoder operates at a fixed resolution of 896 × 896." ...
"We address this issue with an adaptive windowing algorithm during inference. This algorithm segments images into non-overlapping crops of equal size, covering the whole image, and resize them to 896×896 pixels to pass them to the encoder."
How many segments? Always the same?

"This windowing is applied only when necessary, and control for the maximum number of crops."
What is the max number of crops?

" Each image in this multimodal data is represented by 256 image tokens from the respective vision encoder. The higher resolution encoders thus use average pooling to reduce their output to 256 tokens. For instance, the 896 resolution encoder has a 4x4 average pooling on its output. As shown in Table 7, higher resolution encoders perform than smaller ones."
But this table 7 does not show anything with more pixels than 896x896 .. so does that mean that the 896x896 resolution is the maximum after Pan & Scan?

Then Table 8 shows impact of Pan & Scan - but it is unclear what was the actual resolution of those images .. and how many crops were analyzed etc.

Sorry .. this is not my field, so maybe someone can explain a bit better what is going on behind the scenes. Just for example to know that above certain resolution, more details cannot be extracted as there is always certain limited maximum number of crops of a given size that are analyzed.

Thank you!

kerrmetric

Apr 22

I've been trying to figure this out so let me drop in my understanding:

Gemma-3 really likes square images of 896x896 pixels
The algorithm is happy upscaling, but will never downscale (so as not to lose detail)
All patches will be the same size before and after upscaling to 896x896 pixels

So:
If the image is less than 896 × 896 it'll upscale to that resolution and be happy.
If the image is larger or not square and the pan/scan is enabled, it'll chop the image up into non-overlapping squares, resize to 896x896 and run once per square.

e.g. for a standard 1200x1800 image, it will cut the image into 6 square patches (600x600 each) upscale each one, run the vision head once per and concatinate the results.

I've no idea how it will handle wierd edgecases like 1209 x 1663 (i.e. prime x prime) but you can limit the number of patches generated with a parameter.

My TLDR for use with cameras is ... do your own cropping / rescaling ahead of time!

Marbuel

Apr 23

I'm not so sure about this. Would not a larger image need more tokens? When I analyzed images of various sizes, the input tokens were always the same. Or maybe, I just don't know how to allow this full resolution analysis when prompting the model running on a local Ollama server. Any ideas?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment