File size: 5,941 Bytes

ee9063a

---
license: apache-2.0
---
# Model Card: SuryaKrishna02/swinv2-roberta-openclip

## Model Description

The `swinv2-roberta-openclip` model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements.

This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations.

## Model Architecture

- **Image Encoder**: Swin Transformer V2 Base (Window 12, 192px)
  - Pre-trained `swinv2_base_window12_192.ms_in22k` model from timm
  - A hierarchical vision transformer that uses shifted windows for efficient attention computation
  - Patch dropout of 0.6
  - Outputs image embeddings that capture visual features at multiple scales

- **Text Encoder**: RoBERTa Base
  - Uses `roberta-base` from Hugging Face
  - Mean pooling strategy for sentence embeddings
  - Processes text inputs to generate text embeddings in the same latent space as image embeddings

- **Joint Embedding Space**: 512 dimensions
  - Both image and text features are projected to this common space

- **Framework**: OpenCLIP
  - An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations
  - Enables training on custom datasets with different model architectures

## Use Cases

This model can be used for:

- Zero-shot image classification
- Text-to-image and image-to-text retrieval
- Multimodal search
- Visual reasoning tasks
- Foundation for fine-tuning on downstream tasks

## Limitations

- Performance may vary across domains not well-represented in the training data
- May exhibit biases present in the training datasets
- Visual understanding is limited to image-level features rather than fine-grained object detection

## Training

This model was trained on a subset of the PD12M dataset:

- **Dataset**: 100,000 image-text pairs from PD12M (Product Descriptions 12M)
- **Training Duration**: 3 epochs
- **Pre-processing**:
  - Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711]
  - Bicubic interpolation with "shortest" resize mode
- **Model Initialization**:
  - Vision encoder: Initialized with pre-trained `swinv2_base_window12_192.ms_in22k` weights
  - Text encoder: Initialized with pre-trained `roberta-base` weights
- **Image Size**: 192x192 pixels

The training process involved:
1. Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights
2. Training both encoders jointly using a contrastive learning objective
3. Using the OpenCLIP framework for efficient training

## Usage

```python
import open_clip
import torch
from PIL import Image

# Load model and processors
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
    'hf-hub:SuryaKrishna02/swinv2-roberta-openclip'
)
tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip')

# Process image
image = preprocess_val(Image.open("example.jpg")).unsqueeze(0)

# Process text
text = tokenizer(["a photo of a cat", "a photo of a dog"])

# Generate embeddings
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)

# Calculate similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f"Label probabilities: {similarity}")
```

## Citation

If you use this model in your research, please cite:

```
@software{swinv2_roberta_openclip,
  author = {Guthikonda, Surya Krishna},
  title = {Swinv2-Roberta-OpenCLIP},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip}
}
```

## Model Configuration

```json
{
 "model_cfg": {
    "embed_dim": 512,
    "vision_cfg": {
      "timm_model_name": "swinv2_base_window12_192.ms_in22k",
      "timm_model_pretrained": true,
      "patch_dropout": 0.6,
      "timm_pool": "avg",
      "timm_proj": "linear",
      "image_size": 192
    },
    "text_cfg": {
      "hf_model_name": "roberta-base",
      "hf_tokenizer_name": "roberta-base",
      "hf_pooler_type": "mean_pooler"
    }
  },
  "preprocess_cfg": {
    "mean": [0.48145466, 0.4578275, 0.40821073],
    "std": [0.26862954, 0.26130258, 0.27577711],
    "interpolation": "bicubic",
    "resize_mode": "shortest"
  }
}
```

## References

- OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip)
- Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883)
- RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692)
- PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M)

## License

This model is released under the Apache License 2.0.

```
Copyright 2025 Surya Guthikonda

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```