Image tokenization
Could you please explain a bit more how are tokens generated for images that are not 896x896 pixels? Do you always just "resize" the whole image to fit into this size? And then is 56x56 pixels per token, to get 256 tokens total? Most images are non-square ratios .. what is put there to fill the square? Black pixels?
Would there be any advantage in feeding the model with resized image directly to 896x896px? Or is the model doing anything with full res image?
Thank you!
Gemma uses a fixed-resolution (896x896) vision encoder which struggles with non-square or high-resolution images which can cause potential loss of detail. To address this, an adaptive windowing algorithm ("Pan and Scan") is implemented during inference. This algorithm breaks down images into smaller, equal-sized crops, resizes them to 896x896 and then feeds them to the encoder. This windowing is applied only when needed and can be disabled for faster inference, representing a trade-off between speed and accuracy. Please have a look at this Gemma3 technical report for more details on Pan & Scan (P&S).
I guess I need to go into details of the SigLIP and LLaVA papers .. I was hopping for some boiled down explanation on one or two examples, because the Gemma paper does not explain this part super well ..
From the paper:
"Pan & Scan (P&S). The Gemma vision encoder operates at a fixed resolution of 896 × 896." ...
"We address this issue with an adaptive windowing algorithm during inference. This algorithm segments images into non-overlapping crops of equal size, covering the whole image, and resize them to 896×896 pixels to pass them to the encoder."
How many segments? Always the same?
"This windowing is applied only when necessary, and control for the maximum number of crops."
What is the max number of crops?
" Each image in this multimodal data is represented by 256 image tokens from the respective vision encoder. The higher resolution encoders thus use average pooling to reduce their output to 256 tokens. For instance, the 896 resolution encoder has a 4x4 average pooling on its output. As shown in Table 7, higher resolution encoders perform than smaller ones."
But this table 7 does not show anything with more pixels than 896x896 .. so does that mean that the 896x896 resolution is the maximum after Pan & Scan?
Then Table 8 shows impact of Pan & Scan - but it is unclear what was the actual resolution of those images .. and how many crops were analyzed etc.
Sorry .. this is not my field, so maybe someone can explain a bit better what is going on behind the scenes. Just for example to know that above certain resolution, more details cannot be extracted as there is always certain limited maximum number of crops of a given size that are analyzed.
Thank you!