File size: 6,425 Bytes
473a297
a823b27
 
473a297
a823b27
473a297
a823b27
 
 
 
 
473a297
a823b27
 
 
d120498
a823b27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b05cc13
 
a823b27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b05cc13
 
a823b27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d120498
a823b27
 
 
 
 
23f0f02
 
b34e7fc
3a676e2
a823b27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3a676e2
a823b27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: apache-2.0
pipeline_tag: zero-shot-image-classification
tags:
- datology
- clip
- vision
- OpenCLIP
- datacomp
- image-text-retrieval
- multimodal
---

# DatologyAI CLIP Retrieval Optimized ViT-B/32

**DatologyAI CLIP Retrieval** is a state-of-the-art contrastive vision-language model optimized for image-text retrieval tasks through advanced data curation. This retrieval-optimized ViT-B/32 model achieves competitive performance with SigLIP2 while requiring significantly less compute.

## Model Description

DatologyAI's retrieval-optimized CLIP model demonstrates superior performance on retrieval benchmarks through targeted data curation strategies:

- **State-of-the-art MSCOCO performance** for ViT-B/32 models
- **2x training efficiency** compared to SigLIP2
- Optimized for text-based distribution alignment
- Standard CLIP architecture with retrieval-focused data curation

## Intended Uses

This model is optimized for image-text retrieval tasks, cross-modal search, and multimodal understanding applications.

### Image-to-Text Retrieval

```python
import torch
from PIL import Image
import open_clip

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')

# Load and process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Define text candidates
texts = [
    "a photo of a cat",
    "a dog playing in the park",
    "a beautiful sunset over the ocean",
    "people walking in a city"
]
text_tokens = tokenizer(texts)

# Compute similarities
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_tokens)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Calculate similarity
    similarity = (100.0 * image_features @ text_features.T)
    
# Get top matches
values, indices = similarity[0].topk(len(texts))
for idx, score in zip(indices, values):
    print(f"{texts[idx]}: {score.item():.2f}")
```

### Text-to-Image Retrieval

```python
import torch
import open_clip
from typing import List

def retrieve_images(query: str, image_features: torch.Tensor, top_k: int = 5):
    """
    Retrieve top-k images for a text query
    
    Args:
        query: Text description to search for
        image_features: Pre-computed normalized image features [N, 512]
        top_k: Number of images to retrieve
    """
    # Encode text query
    text_tokens = tokenizer([query])
    with torch.no_grad():
        text_features = model.encode_text(text_tokens)
        text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Compute similarities
    similarities = (100.0 * text_features @ image_features.T).squeeze()
    
    # Get top-k matches
    values, indices = similarities.topk(top_k)
    return indices.tolist(), values.tolist()

# Example usage
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')

# Pre-compute image features for your dataset
# image_features = ... # Shape: [num_images, 512]

# Search for images
indices, scores = retrieve_images("a red sports car", image_features)
```

## Training Procedure

DatologyAI's retrieval-optimized pipeline employs specialized curation techniques:

1. **Text-aligned distribution matching** - Prioritizes alignment along text representations for retrieval tasks
2. **Retrieval-specific synthetic data** - Optimized caption generation for cross-modal understanding  
3. **Balanced multimodal representation** - Ensures strong performance in both directions

The model uses standard CLIP contrastive objectives without architectural modifications.

## Training Data

The model was trained on image-text pairs curated from the **DataComp-XL** dataset using DatologyAI's retrieval-optimized curation pipeline, selecting high-quality pairs that enhance cross-modal alignment.

## Evaluation Results

### Retrieval Performance

| Benchmark | Metric | DatologyAI | SigLIP2 | MetaCLIP |
|-----------|--------|------------|---------|----------|
| **MSCOCO** |  Retrieval@1 | 55.53% | 55.45% | 46.6% | 
| **Flickr30K** | Retrieval@1 | 79.7% | 82.4% | 72.9% | 

### Training Efficiency
- Matches SigLIP2 MSCOCO performance with **50% fewer samples** (20B vs 40B)
- Exceeds MetaCLIP by >5% absolute on both benchmarks

## Model Details

- **Developed by:** DatologyAI
- **Model type:** CLIP (Contrastive Language-Image Pre-training)
- **Architecture:** Vision Transformer B/32
- **License:** Apache 2.0
- **Training framework:** OpenCLIP 2.24.0
- **Optimization focus:** Image-text retrieval

## Technical Specifications

### Model Architecture
- **Vision Encoder:** ViT-B/32 (86M parameters)
  - Patch size: 32×32
  - Image size: 224×224
  - Embedding dimension: 512
- **Text Encoder:** 12-layer Transformer
  - Context length: 77 tokens
  - Vocabulary size: 49,408 (BPE tokenizer)

### Training Configuration
- **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6)
- **Learning rate:** 1e-3 with cosine schedule
- **Weight decay:** 0.1
- **Batch size:** 32,768
- **Training approach:** Retrieval-optimized data curation
- **Hardware:** Distributed training on H100 GPUs

## Usage Tips

1. **Feature Caching**: For large-scale retrieval, pre-compute and cache image features
2. **Batch Processing**: Process multiple queries simultaneously for efficiency
3. **Normalization**: Always normalize features before computing similarities
4. **Temperature Scaling**: Adjust similarity temperature for different use cases

## Citation

If you use this model, please cite:

```bibtex
@article{datologyai2025clip,
  title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
  author={DatologyAI Team},
  journal={DatologyAI Blog},
  year={2025},
  url={https://datologyai.com/blog/clip-data-upgrade}
}
```

## Additional Information

For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade).

**Contact:** [[email protected]](mailto:[email protected])

## Model Card Contact

DatologyAI Team - [[email protected]](mailto:[email protected])