Model Card: SuryaKrishna02/swinv2-roberta-openclip
Model Description
The swinv2-roberta-openclip model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements.
This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations.
Model Architecture
Image Encoder: Swin Transformer V2 Base (Window 12, 192px)
- Pre-trained 
swinv2_base_window12_192.ms_in22kmodel from timm - A hierarchical vision transformer that uses shifted windows for efficient attention computation
 - Patch dropout of 0.6
 - Outputs image embeddings that capture visual features at multiple scales
 
- Pre-trained 
 Text Encoder: RoBERTa Base
- Uses 
roberta-basefrom Hugging Face - Mean pooling strategy for sentence embeddings
 - Processes text inputs to generate text embeddings in the same latent space as image embeddings
 
- Uses 
 Joint Embedding Space: 512 dimensions
- Both image and text features are projected to this common space
 
Framework: OpenCLIP
- An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations
 - Enables training on custom datasets with different model architectures
 
Use Cases
This model can be used for:
- Zero-shot image classification
 - Text-to-image and image-to-text retrieval
 - Multimodal search
 - Visual reasoning tasks
 - Foundation for fine-tuning on downstream tasks
 
Limitations
- Performance may vary across domains not well-represented in the training data
 - May exhibit biases present in the training datasets
 - Visual understanding is limited to image-level features rather than fine-grained object detection
 
Training
This model was trained on a subset of the PD12M dataset:
- Dataset: 100,000 image-text pairs from PD12M (Product Descriptions 12M)
 - Training Duration: 3 epochs
 - Pre-processing:
- Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711]
 - Bicubic interpolation with "shortest" resize mode
 
 - Model Initialization:
- Vision encoder: Initialized with pre-trained 
swinv2_base_window12_192.ms_in22kweights - Text encoder: Initialized with pre-trained 
roberta-baseweights 
 - Vision encoder: Initialized with pre-trained 
 - Image Size: 192x192 pixels
 
The training process involved:
- Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights
 - Training both encoders jointly using a contrastive learning objective
 - Using the OpenCLIP framework for efficient training
 
Usage
import open_clip
import torch
from PIL import Image
# Load model and processors
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:SuryaKrishna02/swinv2-roberta-openclip'
)
tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip')
# Process image
image = preprocess_val(Image.open("example.jpg")).unsqueeze(0)
# Process text
text = tokenizer(["a photo of a cat", "a photo of a dog"])
# Generate embeddings
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)
# Calculate similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f"Label probabilities: {similarity}")
Citation
If you use this model in your research, please cite:
@software{swinv2_roberta_openclip,
  author = {Guthikonda, Surya Krishna},
  title = {Swinv2-Roberta-OpenCLIP},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip}
}
Model Configuration
{
 "model_cfg": {
    "embed_dim": 512,
    "vision_cfg": {
      "timm_model_name": "swinv2_base_window12_192.ms_in22k",
      "timm_model_pretrained": true,
      "patch_dropout": 0.6,
      "timm_pool": "avg",
      "timm_proj": "linear",
      "image_size": 192
    },
    "text_cfg": {
      "hf_model_name": "roberta-base",
      "hf_tokenizer_name": "roberta-base",
      "hf_pooler_type": "mean_pooler"
    }
  },
  "preprocess_cfg": {
    "mean": [0.48145466, 0.4578275, 0.40821073],
    "std": [0.26862954, 0.26130258, 0.27577711],
    "interpolation": "bicubic",
    "resize_mode": "shortest"
  }
}
References
- OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip)
 - Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883)
 - RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692)
 - PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M)
 
License
This model is released under the Apache License 2.0.
Copyright 2025 Surya Guthikonda
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
- Downloads last month
 - 10