Model Card: SuryaKrishna02/swinv2-roberta-openclip

Model Description

The swinv2-roberta-openclip model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements.

This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations.

Model Architecture

Image Encoder: Swin Transformer V2 Base (Window 12, 192px)
- Pre-trained swinv2_base_window12_192.ms_in22k model from timm
- A hierarchical vision transformer that uses shifted windows for efficient attention computation
- Patch dropout of 0.6
- Outputs image embeddings that capture visual features at multiple scales
Text Encoder: RoBERTa Base
- Uses roberta-base from Hugging Face
- Mean pooling strategy for sentence embeddings
- Processes text inputs to generate text embeddings in the same latent space as image embeddings
Joint Embedding Space: 512 dimensions
- Both image and text features are projected to this common space
Framework: OpenCLIP
- An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations
- Enables training on custom datasets with different model architectures

Use Cases

This model can be used for:

Zero-shot image classification
Text-to-image and image-to-text retrieval
Multimodal search
Visual reasoning tasks
Foundation for fine-tuning on downstream tasks

Limitations

Performance may vary across domains not well-represented in the training data
May exhibit biases present in the training datasets
Visual understanding is limited to image-level features rather than fine-grained object detection

Training

This model was trained on a subset of the PD12M dataset:

Dataset: 100,000 image-text pairs from PD12M (Product Descriptions 12M)
Training Duration: 3 epochs
Pre-processing:
- Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711]
- Bicubic interpolation with "shortest" resize mode
Model Initialization:
- Vision encoder: Initialized with pre-trained swinv2_base_window12_192.ms_in22k weights
- Text encoder: Initialized with pre-trained roberta-base weights
Image Size: 192x192 pixels

The training process involved:

Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights
Training both encoders jointly using a contrastive learning objective
Using the OpenCLIP framework for efficient training

Usage

import open_clip
import torch
from PIL import Image

# Load model and processors
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:SuryaKrishna02/swinv2-roberta-openclip'
)
tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip')

# Process image
image = preprocess_val(Image.open("example.jpg")).unsqueeze(0)

# Process text
text = tokenizer(["a photo of a cat", "a photo of a dog"])

# Generate embeddings
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)

# Calculate similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f"Label probabilities: {similarity}")

Citation

If you use this model in your research, please cite:

@software{swinv2_roberta_openclip,
  author = {Guthikonda, Surya Krishna},
  title = {Swinv2-Roberta-OpenCLIP},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip}
}

Model Configuration

{
 "model_cfg": {
    "embed_dim": 512,
    "vision_cfg": {
      "timm_model_name": "swinv2_base_window12_192.ms_in22k",
      "timm_model_pretrained": true,
      "patch_dropout": 0.6,
      "timm_pool": "avg",
      "timm_proj": "linear",
      "image_size": 192
    },
    "text_cfg": {
      "hf_model_name": "roberta-base",
      "hf_tokenizer_name": "roberta-base",
      "hf_pooler_type": "mean_pooler"
    }
  },
  "preprocess_cfg": {
    "mean": [0.48145466, 0.4578275, 0.40821073],
    "std": [0.26862954, 0.26130258, 0.27577711],
    "interpolation": "bicubic",
    "resize_mode": "shortest"
  }
}

References

OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip)
Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883)
RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692)
PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M)

License

This model is released under the Apache License 2.0.

Copyright 2025 Surya Guthikonda

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support