OpenCLIP
Safetensors
SuryaKrishna02 commited on
Commit
ee9063a
·
verified ·
1 Parent(s): 18c35a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -3
README.md CHANGED
@@ -1,3 +1,168 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Model Card: SuryaKrishna02/swinv2-roberta-openclip
5
+
6
+ ## Model Description
7
+
8
+ The `swinv2-roberta-openclip` model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements.
9
+
10
+ This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations.
11
+
12
+ ## Model Architecture
13
+
14
+ - **Image Encoder**: Swin Transformer V2 Base (Window 12, 192px)
15
+ - Pre-trained `swinv2_base_window12_192.ms_in22k` model from timm
16
+ - A hierarchical vision transformer that uses shifted windows for efficient attention computation
17
+ - Patch dropout of 0.6
18
+ - Outputs image embeddings that capture visual features at multiple scales
19
+
20
+ - **Text Encoder**: RoBERTa Base
21
+ - Uses `roberta-base` from Hugging Face
22
+ - Mean pooling strategy for sentence embeddings
23
+ - Processes text inputs to generate text embeddings in the same latent space as image embeddings
24
+
25
+ - **Joint Embedding Space**: 512 dimensions
26
+ - Both image and text features are projected to this common space
27
+
28
+ - **Framework**: OpenCLIP
29
+ - An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations
30
+ - Enables training on custom datasets with different model architectures
31
+
32
+ ## Use Cases
33
+
34
+ This model can be used for:
35
+
36
+ - Zero-shot image classification
37
+ - Text-to-image and image-to-text retrieval
38
+ - Multimodal search
39
+ - Visual reasoning tasks
40
+ - Foundation for fine-tuning on downstream tasks
41
+
42
+ ## Limitations
43
+
44
+ - Performance may vary across domains not well-represented in the training data
45
+ - May exhibit biases present in the training datasets
46
+ - Visual understanding is limited to image-level features rather than fine-grained object detection
47
+
48
+ ## Training
49
+
50
+ This model was trained on a subset of the PD12M dataset:
51
+
52
+ - **Dataset**: 100,000 image-text pairs from PD12M (Product Descriptions 12M)
53
+ - **Training Duration**: 3 epochs
54
+ - **Pre-processing**:
55
+ - Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711]
56
+ - Bicubic interpolation with "shortest" resize mode
57
+ - **Model Initialization**:
58
+ - Vision encoder: Initialized with pre-trained `swinv2_base_window12_192.ms_in22k` weights
59
+ - Text encoder: Initialized with pre-trained `roberta-base` weights
60
+ - **Image Size**: 192x192 pixels
61
+
62
+ The training process involved:
63
+ 1. Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights
64
+ 2. Training both encoders jointly using a contrastive learning objective
65
+ 3. Using the OpenCLIP framework for efficient training
66
+
67
+ ## Usage
68
+
69
+ ```python
70
+ import open_clip
71
+ import torch
72
+ from PIL import Image
73
+
74
+ # Load model and processors
75
+ model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
76
+ 'hf-hub:SuryaKrishna02/swinv2-roberta-openclip'
77
+ )
78
+ tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip')
79
+
80
+ # Process image
81
+ image = preprocess_val(Image.open("example.jpg")).unsqueeze(0)
82
+
83
+ # Process text
84
+ text = tokenizer(["a photo of a cat", "a photo of a dog"])
85
+
86
+ # Generate embeddings
87
+ with torch.no_grad():
88
+ image_features = model.encode_image(image)
89
+ text_features = model.encode_text(text)
90
+
91
+ # Normalize features
92
+ image_features = image_features / image_features.norm(dim=1, keepdim=True)
93
+ text_features = text_features / text_features.norm(dim=1, keepdim=True)
94
+
95
+ # Calculate similarity
96
+ similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
97
+ print(f"Label probabilities: {similarity}")
98
+ ```
99
+
100
+ ## Citation
101
+
102
+ If you use this model in your research, please cite:
103
+
104
+ ```
105
+ @software{swinv2_roberta_openclip,
106
+ author = {Guthikonda, Surya Krishna},
107
+ title = {Swinv2-Roberta-OpenCLIP},
108
+ year = {2025},
109
+ publisher = {Hugging Face},
110
+ url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip}
111
+ }
112
+ ```
113
+
114
+ ## Model Configuration
115
+
116
+ ```json
117
+ {
118
+ "model_cfg": {
119
+ "embed_dim": 512,
120
+ "vision_cfg": {
121
+ "timm_model_name": "swinv2_base_window12_192.ms_in22k",
122
+ "timm_model_pretrained": true,
123
+ "patch_dropout": 0.6,
124
+ "timm_pool": "avg",
125
+ "timm_proj": "linear",
126
+ "image_size": 192
127
+ },
128
+ "text_cfg": {
129
+ "hf_model_name": "roberta-base",
130
+ "hf_tokenizer_name": "roberta-base",
131
+ "hf_pooler_type": "mean_pooler"
132
+ }
133
+ },
134
+ "preprocess_cfg": {
135
+ "mean": [0.48145466, 0.4578275, 0.40821073],
136
+ "std": [0.26862954, 0.26130258, 0.27577711],
137
+ "interpolation": "bicubic",
138
+ "resize_mode": "shortest"
139
+ }
140
+ }
141
+ ```
142
+
143
+ ## References
144
+
145
+ - OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip)
146
+ - Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883)
147
+ - RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692)
148
+ - PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M)
149
+
150
+ ## License
151
+
152
+ This model is released under the Apache License 2.0.
153
+
154
+ ```
155
+ Copyright 2025 Surya Guthikonda
156
+
157
+ Licensed under the Apache License, Version 2.0 (the "License");
158
+ you may not use this file except in compliance with the License.
159
+ You may obtain a copy of the License at
160
+
161
+ http://www.apache.org/licenses/LICENSE-2.0
162
+
163
+ Unless required by applicable law or agreed to in writing, software
164
+ distributed under the License is distributed on an "AS IS" BASIS,
165
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
166
+ See the License for the specific language governing permissions and
167
+ limitations under the License.
168
+ ```