SigLIP or SigLIP2 encoder?

#48
by orrzohar - opened

SigLIP or SigLIP2 encoder?

Google org

Hi @orrzohar ,

Yes, SigLIP and SigLIP 2 utilize similar encoder architectures, both employing the Vision Transformer (ViT) design with learned positional embeddings.
Could you please refer this reference.

Thank you.

Yes, but which SigLIP checkpoint did Gemma3 use? SigLIP2 or SigLIP?
Thank you!
Orr

@orrzohar : A or B?
@GopiUppari : Yes, ...
🤣

@orrzohar
Yes, SigLIP 2 utilizes a similar encoder architecture to SigLIP. In Gemma 3, they used a 400M-parameter variant of the SigLIP vision encoder.
SigLIP-So400m

@prithivMLmods
Trust me, I am familiar with SigLIP and SigLIP2. Both have shaped-optimized model variants. I know.
I just want to know WHICH was used.
Are you from Google org (there is no Google tag, and I already checked the technical report and all model configs, and you can't tell from those)?
Do you know that they used SigLIP-SO400M and not SigLIP2-SO-400M?

Thanks
Orr

No, I'm not from Google Org. I just read your discussion, so I responded.
I also analyzed the technical report, but I didn’t see anything about it.
I remember reading in an article, possibly from Gradient Flow about Gemma 3 (March, mid). It clearly mentioned that they used a 400M-parameter variant of the SigLIP vision encoder.
@orrzohar

Edit :

Yeah, this newsletter!
https://gradientflow.com/gemma-3-what-you-need-to-know/?utm_source=chatgpt.com

I’ve had this question since the release. Rather than guessing, let's ask the organization directly once again.

Hi @GopiUppari , can you please tell me exactly which vision encoder (SigLIP or SigLIP2) is used in the Gemma 3 family of models? Is it the SigLIP-SO400M?

Thankyou !
Prithiv

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment