SuryaKrishna02
/

swinv2-roberta-openclip

OpenCLIP

Safetensors

Model card Files Files and versions

xet

Community

SuryaKrishna02 commited on May 5

Commit

ee9063a

verified ·

1 Parent(s): 18c35a8

Update README.md

Browse files

Files changed (1) hide show

README.md +168 -3

README.md CHANGED Viewed

@@ -1,3 +1,168 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# Model Card: SuryaKrishna02/swinv2-roberta-openclip
+## Model Description
+The `swinv2-roberta-openclip` model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements.
+This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations.
+## Model Architecture
+- **Image Encoder**: Swin Transformer V2 Base (Window 12, 192px)
+  - Pre-trained `swinv2_base_window12_192.ms_in22k` model from timm
+  - A hierarchical vision transformer that uses shifted windows for efficient attention computation
+  - Patch dropout of 0.6
+  - Outputs image embeddings that capture visual features at multiple scales
+- **Text Encoder**: RoBERTa Base
+  - Uses `roberta-base` from Hugging Face
+  - Mean pooling strategy for sentence embeddings
+  - Processes text inputs to generate text embeddings in the same latent space as image embeddings
+- **Joint Embedding Space**: 512 dimensions
+  - Both image and text features are projected to this common space
+- **Framework**: OpenCLIP
+  - An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations
+  - Enables training on custom datasets with different model architectures
+## Use Cases
+This model can be used for:
+- Zero-shot image classification
+- Text-to-image and image-to-text retrieval
+- Multimodal search
+- Visual reasoning tasks
+- Foundation for fine-tuning on downstream tasks
+## Limitations
+- Performance may vary across domains not well-represented in the training data
+- May exhibit biases present in the training datasets
+- Visual understanding is limited to image-level features rather than fine-grained object detection
+## Training
+This model was trained on a subset of the PD12M dataset:
+- **Dataset**: 100,000 image-text pairs from PD12M (Product Descriptions 12M)
+- **Training Duration**: 3 epochs
+- **Pre-processing**:
+  - Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711]
+  - Bicubic interpolation with "shortest" resize mode
+- **Model Initialization**:
+  - Vision encoder: Initialized with pre-trained `swinv2_base_window12_192.ms_in22k` weights
+  - Text encoder: Initialized with pre-trained `roberta-base` weights
+- **Image Size**: 192x192 pixels
+The training process involved:
+1. Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights
+2. Training both encoders jointly using a contrastive learning objective
+3. Using the OpenCLIP framework for efficient training
+## Usage
+```python
+import open_clip
+import torch
+from PIL import Image
+# Load model and processors
+model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
+    'hf-hub:SuryaKrishna02/swinv2-roberta-openclip'
+)
+tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip')
+# Process image
+image = preprocess_val(Image.open("example.jpg")).unsqueeze(0)
+# Process text
+text = tokenizer(["a photo of a cat", "a photo of a dog"])
+# Generate embeddings
+with torch.no_grad():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    # Normalize features
+    image_features = image_features / image_features.norm(dim=1, keepdim=True)
+    text_features = text_features / text_features.norm(dim=1, keepdim=True)
+# Calculate similarity
+similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
+print(f"Label probabilities: {similarity}")
+```
+## Citation
+If you use this model in your research, please cite:
+```
+@software{swinv2_roberta_openclip,
+  author = {Guthikonda, Surya Krishna},
+  title = {Swinv2-Roberta-OpenCLIP},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip}
+}
+```
+## Model Configuration
+```json
+{
+ "model_cfg": {
+    "embed_dim": 512,
+    "vision_cfg": {
+      "timm_model_name": "swinv2_base_window12_192.ms_in22k",
+      "timm_model_pretrained": true,
+      "patch_dropout": 0.6,
+      "timm_pool": "avg",
+      "timm_proj": "linear",
+      "image_size": 192
+    },
+    "text_cfg": {
+      "hf_model_name": "roberta-base",
+      "hf_tokenizer_name": "roberta-base",
+      "hf_pooler_type": "mean_pooler"
+    }
+  },
+  "preprocess_cfg": {
+    "mean": [0.48145466, 0.4578275, 0.40821073],
+    "std": [0.26862954, 0.26130258, 0.27577711],
+    "interpolation": "bicubic",
+    "resize_mode": "shortest"
+  }
+}
+```
+## References
+- OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip)
+- Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883)
+- RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692)
+- PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M)
+## License
+This model is released under the Apache License 2.0.
+```
+Copyright 2025 Surya Guthikonda
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+```