What is the max image resolution?

by Alejandro98 - opened Feb 22

Discussion

Alejandro98

Feb 22

Question above

mitsch

Mar 2

It's a tricky question to answer directly - the patch size is 16 and the maximum sequence length 1024. For square images this results in 512 x 512.

lntzm

26 days ago

but the default max_num_patches in config of the processor is 256 rather than 1024. Can we directly overwrite it?

mitsch

26 days ago

Not an expert on this, but maybe you can just pass it as an argument when running the preprocessing, see here
https://github.com/huggingface/transformers/issues/30282#issuecomment-2060791408

lntzm

26 days ago

Thx for your reply:)

ostris

22 days ago

The patch embedding is only 256, vision_model.embeddings.position_embedding.weight [256, 1 152]. This would mean that the max patches is 256 right? Meaning a 256x256 patch equivalent image.

mitsch

13 days ago

In NaFlex, the 256 length positional embedding is dynamically resized to the desired target sequence length in the model. So it in principle supports any sequence length, but will likely not generalize too well beyond the maximum training sequence length of 1024.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment