Error loading the SigLIP2 Vision model
Description
Running the code snippet provided in the documentation for Siglip2VisionModel
yields a RuntimeError due to shape mismatch when loading the model checkpoint.
Steps to Reproduce
- Install transformers library
- Run the following code:
from transformers import Siglip2VisionModel
model = Siglip2VisionModel.from_pretrained("google/siglip2-base-patch16-224")
Expected Behavior
The model should load successfully without errors.
Actual Behavior
The following error is raised:
You are using a model of type siglip_vision_model to instantiate a model of type siglip2_vision_model.
This is not supported for all configurations of models and can yield errors.
[...]
RuntimeError: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([768, 3, 16, 16])
from checkpoint, the shape in current model is torch.Size([768, 768]).
Additional Investigation
Loading a SigLIP2 model from checkpoint effectively attempts to load the model using the wrong class, using SigLIP classes instead of SigLIP2:
from transformers import AutoModel
model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
model.vision_model.__class__
# Output: transformers.models.siglip.modeling_siglip.SiglipVisionTransformer
Root Cause Analysis
I believe this is because this checkpoint, as well as most other SigLIP2 checkpoints, are not defined in src/transformers/models/siglip2/convert_siglip2_to_hf.py
but rather in src/transformers/models/siglip/convert_siglip_to_hf.py
.
Proposed Solution
Porting the SigLIP2 checkpoints to the SigLIP2 conversion file may fix the error.
Environment
- transformers version: 4.52.4
- PyTorch version: 2.6.0
- Python version: 3.11.11
- Operating System: linux
Additional Context
This affects the google/siglip2-base-patch16-224
checkpoint and affects all checkpoints except google/siglip2-base-patch16-naflex
and google/siglip2-so400m-patch16-naflex
, that are correctly defined in src/transformers/models/siglip/convert_siglip_to_hf.py
.