Zero-Shot Image Classification
Transformers
Safetensors
siglip2
vision

What is the max image resolution?

#2
by Alejandro98 - opened

Question above

It's a tricky question to answer directly - the patch size is 16 and the maximum sequence length 1024. For square images this results in 512 x 512.

but the default max_num_patches in config of the processor is 256 rather than 1024. Can we directly overwrite it?

Not an expert on this, but maybe you can just pass it as an argument when running the preprocessing, see here
https://github.com/huggingface/transformers/issues/30282#issuecomment-2060791408

Thx for your reply:)

The patch embedding is only 256, vision_model.embeddings.position_embedding.weight [256, 1β€―152]. This would mean that the max patches is 256 right? Meaning a 256x256 patch equivalent image.

In NaFlex, the 256 length positional embedding is dynamically resized to the desired target sequence length in the model. So it in principle supports any sequence length, but will likely not generalize too well beyond the maximum training sequence length of 1024.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment