File size: 6,425 Bytes
473a297 a823b27 473a297 a823b27 473a297 a823b27 473a297 a823b27 d120498 a823b27 b05cc13 a823b27 b05cc13 a823b27 d120498 a823b27 23f0f02 b34e7fc 3a676e2 a823b27 3a676e2 a823b27 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: apache-2.0
pipeline_tag: zero-shot-image-classification
tags:
- datology
- clip
- vision
- OpenCLIP
- datacomp
- image-text-retrieval
- multimodal
---
# DatologyAI CLIP Retrieval Optimized ViT-B/32
**DatologyAI CLIP Retrieval** is a state-of-the-art contrastive vision-language model optimized for image-text retrieval tasks through advanced data curation. This retrieval-optimized ViT-B/32 model achieves competitive performance with SigLIP2 while requiring significantly less compute.
## Model Description
DatologyAI's retrieval-optimized CLIP model demonstrates superior performance on retrieval benchmarks through targeted data curation strategies:
- **State-of-the-art MSCOCO performance** for ViT-B/32 models
- **2x training efficiency** compared to SigLIP2
- Optimized for text-based distribution alignment
- Standard CLIP architecture with retrieval-focused data curation
## Intended Uses
This model is optimized for image-text retrieval tasks, cross-modal search, and multimodal understanding applications.
### Image-to-Text Retrieval
```python
import torch
from PIL import Image
import open_clip
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')
# Load and process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)
# Define text candidates
texts = [
"a photo of a cat",
"a dog playing in the park",
"a beautiful sunset over the ocean",
"people walking in a city"
]
text_tokens = tokenizer(texts)
# Compute similarities
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
# Normalize features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Calculate similarity
similarity = (100.0 * image_features @ text_features.T)
# Get top matches
values, indices = similarity[0].topk(len(texts))
for idx, score in zip(indices, values):
print(f"{texts[idx]}: {score.item():.2f}")
```
### Text-to-Image Retrieval
```python
import torch
import open_clip
from typing import List
def retrieve_images(query: str, image_features: torch.Tensor, top_k: int = 5):
"""
Retrieve top-k images for a text query
Args:
query: Text description to search for
image_features: Pre-computed normalized image features [N, 512]
top_k: Number of images to retrieve
"""
# Encode text query
text_tokens = tokenizer([query])
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Compute similarities
similarities = (100.0 * text_features @ image_features.T).squeeze()
# Get top-k matches
values, indices = similarities.topk(top_k)
return indices.tolist(), values.tolist()
# Example usage
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')
# Pre-compute image features for your dataset
# image_features = ... # Shape: [num_images, 512]
# Search for images
indices, scores = retrieve_images("a red sports car", image_features)
```
## Training Procedure
DatologyAI's retrieval-optimized pipeline employs specialized curation techniques:
1. **Text-aligned distribution matching** - Prioritizes alignment along text representations for retrieval tasks
2. **Retrieval-specific synthetic data** - Optimized caption generation for cross-modal understanding
3. **Balanced multimodal representation** - Ensures strong performance in both directions
The model uses standard CLIP contrastive objectives without architectural modifications.
## Training Data
The model was trained on image-text pairs curated from the **DataComp-XL** dataset using DatologyAI's retrieval-optimized curation pipeline, selecting high-quality pairs that enhance cross-modal alignment.
## Evaluation Results
### Retrieval Performance
| Benchmark | Metric | DatologyAI | SigLIP2 | MetaCLIP |
|-----------|--------|------------|---------|----------|
| **MSCOCO** | Retrieval@1 | 55.53% | 55.45% | 46.6% |
| **Flickr30K** | Retrieval@1 | 79.7% | 82.4% | 72.9% |
### Training Efficiency
- Matches SigLIP2 MSCOCO performance with **50% fewer samples** (20B vs 40B)
- Exceeds MetaCLIP by >5% absolute on both benchmarks
## Model Details
- **Developed by:** DatologyAI
- **Model type:** CLIP (Contrastive Language-Image Pre-training)
- **Architecture:** Vision Transformer B/32
- **License:** Apache 2.0
- **Training framework:** OpenCLIP 2.24.0
- **Optimization focus:** Image-text retrieval
## Technical Specifications
### Model Architecture
- **Vision Encoder:** ViT-B/32 (86M parameters)
- Patch size: 32×32
- Image size: 224×224
- Embedding dimension: 512
- **Text Encoder:** 12-layer Transformer
- Context length: 77 tokens
- Vocabulary size: 49,408 (BPE tokenizer)
### Training Configuration
- **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6)
- **Learning rate:** 1e-3 with cosine schedule
- **Weight decay:** 0.1
- **Batch size:** 32,768
- **Training approach:** Retrieval-optimized data curation
- **Hardware:** Distributed training on H100 GPUs
## Usage Tips
1. **Feature Caching**: For large-scale retrieval, pre-compute and cache image features
2. **Batch Processing**: Process multiple queries simultaneously for efficiency
3. **Normalization**: Always normalize features before computing similarities
4. **Temperature Scaling**: Adjust similarity temperature for different use cases
## Citation
If you use this model, please cite:
```bibtex
@article{datologyai2025clip,
title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
author={DatologyAI Team},
journal={DatologyAI Blog},
year={2025},
url={https://datologyai.com/blog/clip-data-upgrade}
}
```
## Additional Information
For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade).
**Contact:** [[email protected]](mailto:[email protected])
## Model Card Contact
DatologyAI Team - [[email protected]](mailto:[email protected]) |