JianLiao commited on
Commit
4b12a14
·
verified ·
1 Parent(s): 5387b87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -2
README.md CHANGED
@@ -1,8 +1,174 @@
1
  ---
2
  tags:
3
- - clip
4
  library_name: open_clip
5
  pipeline_tag: zero-shot-image-classification
6
  license: mit
7
  ---
8
- # Model card for CLIP-ViT-L-14-spectrum-icons-20k
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  tags:
3
+ - clip
4
  library_name: open_clip
5
  pipeline_tag: zero-shot-image-classification
6
  license: mit
7
  ---
8
+
9
+ # Model card for CLIP-ViT-L-14-spectrum-icons-23k
10
+
11
+ # Table of Contents
12
+
13
+ 1. [Model Details](#model-details)
14
+ 2. [Uses](#uses)
15
+ 3. [Training Details](#training-details)
16
+ 4. [Evaluation](#evaluation)
17
+ 5. [Acknowledgements](#acknowledgements)
18
+ 6. [Citation](#citation)
19
+ 7. [How To Get Started With the Model](#how-to-get-started-with-the-model)
20
+
21
+ # Model Details
22
+
23
+ ## Model Description
24
+
25
+ This is a fine-tuned CLIP ViT-L/14 model based on the pretrained [`laion/CLIP-ViT-L-14-laion2B-s32B-b82K`](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K) from LAION, adapted for improved text-to-image and image-to-text retrieval tasks using a custom dataset of 23,000 PNG-text caption pairs([JianLiao/spectrum-icons](https://huggingface.co/datasets/JianLiao/spectrum-icons)). The fine-tuning process utilized the OpenCLIP library and NVIDIA GPUs to specialize the model for handling abstract visual features and enhancing RAG performance.
26
+
27
+ The base model was originally trained on the LAION-2B dataset, leveraging natural language supervision to align visual and textual embeddings. This fine-tuning task aimed to adapt the model further for specific domains while maintaining generalization.
28
+
29
+ # Uses
30
+
31
+ ## Direct Use
32
+
33
+ - Zero-shot image classification.
34
+ - Text-to-image and image-to-image retrieval.
35
+ - Improving text-image alignment in abstract visual contexts.
36
+
37
+ ## Downstream Use
38
+
39
+ - Fine-tuning for domain-specific image-text retrieval tasks.
40
+ - Integration into applications requiring enhanced semantic search.
41
+
42
+ # Training Details
43
+
44
+ ## Training Data
45
+
46
+ The model was fine-tuned on 23,000 image-text caption pairs. The dataset was designed to include diverse and abstract visual elements paired with detailed textual descriptions to enhance the model's capability in handling abstract queries and retrieval tasks.
47
+
48
+ ## Training Procedure
49
+
50
+ The fine-tuning was conducted using the OpenCLIP library on a machine with 6 NVIDIA RTX-3090 GPUs. Key hyperparameters include:
51
+
52
+ - **Learning Rate**: `5e-6` with cosine decay.
53
+ - **Batch Size**: `64` per GPU, effective global batch size of `384`.
54
+ - **Epochs**: `40`.
55
+ - **Precision**: Mixed precision (`amp_bf16`) for improved efficiency.
56
+ - **Augmentations**:
57
+ - Color Jitter: `(0.2, 0.2, 0.1, 0.0)` with a probability of `0.7`.
58
+ - Grayscale Probability: `0.2`.
59
+
60
+ The training incorporated gradient checkpointing, distributed data parallelism (NCCL), and regular evaluations for zero-shot performance. Validation was performed after each epoch.
61
+
62
+ # Evaluation
63
+
64
+ ## Testing Data, Factors & Metrics
65
+
66
+ ### Testing Data
67
+
68
+ The model was evaluated on the validation set split from the 23,000 image-text pairs. Metrics were computed for both **image-to-text** and **text-to-image** retrieval tasks.
69
+
70
+ ### Metrics
71
+
72
+ 1. **Recall at K**:
73
+ - R@1, R@5, R@10 for image-to-text and text-to-image retrieval.
74
+ 2. **Mean Rank** and **Median Rank**:
75
+ - Average and median positions of the correct match in retrieval.
76
+
77
+ ## Results
78
+
79
+ - **Image-to-Text Retrieval**:
80
+
81
+ - R@1: ~70.0%
82
+ - R@5: ~96.0%
83
+ - R@10: ~97.8%
84
+ - Mean Rank: ~2.24
85
+ - Median Rank: ~1.0
86
+
87
+ - **Text-to-Image Retrieval**:
88
+ - R@1: ~70.3%
89
+ - R@5: ~96.4%
90
+ - R@10: ~98.1%
91
+ - Mean Rank: ~2.17
92
+ - Median Rank: ~1.0
93
+
94
+ The results demonstrate robust alignment between visual and textual embeddings, with strong performance on both retrieval tasks.
95
+
96
+ # Acknowledgements
97
+
98
+ - The pretrained base model was developed by LAION and trained on the LAION-2B dataset.
99
+
100
+ # Citation
101
+
102
+ **BibTeX:**
103
+
104
+ ```bibtex
105
+ @inproceedings{cherti2023reproducible,
106
+ title={Reproducible scaling laws for contrastive language-image learning},
107
+ author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
108
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
109
+ pages={2818--2829},
110
+ year={2023}
111
+ }
112
+ ```
113
+
114
+ OpenAI CLIP paper
115
+
116
+ ```bibtex
117
+ @inproceedings{Radford2021LearningTV,
118
+ title={Learning Transferable Visual Models From Natural Language Supervision},
119
+ author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
120
+ booktitle={ICML},
121
+ year={2021}
122
+ }
123
+ ```
124
+
125
+ OpenCLIP software
126
+
127
+ ```bibtex
128
+ @software{ilharco_gabriel_2021_5143773,
129
+ author = {Ilharco, Gabriel and
130
+ Wortsman, Mitchell and
131
+ Wightman, Ross and
132
+ Gordon, Cade and
133
+ Carlini, Nicholas and
134
+ Taori, Rohan and
135
+ Dave, Achal and
136
+ Shankar, Vaishaal and
137
+ Namkoong, Hongseok and
138
+ Miller, John and
139
+ Hajishirzi, Hannaneh and
140
+ Farhadi, Ali and
141
+ Schmidt, Ludwig},
142
+ title = {OpenCLIP},
143
+ month = jul,
144
+ year = 2021,
145
+ note = {If you use this software, please cite it as below.},
146
+ publisher = {Zenodo},
147
+ version = {0.1},
148
+ doi = {10.5281/zenodo.5143773},
149
+ url = {https://doi.org/10.5281/zenodo.5143773}
150
+ }
151
+ ```
152
+
153
+ # How to Get Started with the Model
154
+
155
+ Install the required dependencies and load the fine-tuned model:
156
+
157
+ ```python
158
+ from open_clip import create_model_and_transforms, tokenizer
159
+
160
+ model, preprocess = create_model_and_transforms(
161
+ model_name="hf-hub:JianLiao/CLIP-ViT-L-14-spectrum-icons-20k"
162
+ )
163
+
164
+ tokenizer = tokenizer("ViT-L-14")
165
+
166
+ # Example: Text-to-Image Retrieval
167
+ text_inputs = tokenizer(["a description of the image", "another description of the image"])
168
+ image = preprocess("/path/to/image.png").unsqueeze(0)
169
+
170
+ with torch.no_grad():
171
+ logits_per_image, logits_per_text = model(image, text_inputs)
172
+ probs = logits_per_image.softmax(dim=-1).numpy()
173
+
174
+ ```