DatologyAI CLIP Classification Optimized ViT-B/32

DatologyAI CLIP is a state-of-the-art contrastive vision-language model that achieves superior performance through advanced data curation alone, without any architectural or training modifications. This classification-optimized ViT-B/32 model outperforms SigLIP2, MetaCLIP, and DFN on zero-shot classification benchmarks.

Model Description

DatologyAI's CLIP model demonstrates that careful data curation can drive state-of-the-art performance without modifications to model architecture or training paradigms. Key achievements include:

  • 76.91% ImageNet1k accuracy (vs 74.0% for SigLIP2)
  • 8x training efficiency compared to standard approaches
  • Trained on 13B curated image-text pairs from DataComp
  • Standard CLIP architecture and training procedure

Intended Uses

You can use this model for zero-shot image classification or as a vision encoder for VLMs and other vision tasks.

Zero-shot Image Classification

import torch
from PIL import Image
import open_clip

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/cls-opt-vit-b-32')

# Load image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Define candidate labels
labels = ["a dog", "a cat", "a bird"]
text = tokenizer(labels)

# Run inference
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # Normalize features
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    # Calculate similarity
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
# Get predictions
values, indices = similarity[0].topk(3)
for value, index in zip(values, indices):
    print(f"{labels[index]}: {value.item():.2%}")

Image Encoding

import torch
from PIL import Image
import open_clip

# Load model
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
model.eval()

# Process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

# Extract features
with torch.no_grad():
    image_features = model.encode_image(image)
    
print(f"Feature shape: {image_features.shape}")  # [1, 512]

Training Procedure

DatologyAI's training pipeline focuses on sophisticated data curation techniques including:

  1. Improved target distribution matching - Task-specific alignment of image features for classification
  2. Enhanced synthetic data generation - Optimized caption generation for classification tasks
  3. Predictive metrics for curation quality - Rapid iteration without full model training

The model uses standard CLIP training objectives with no architectural modifications.

Training Data

The model was trained on 13B image-text (multi-epoch) curated from the DataComp-XL dataset using DatologyAI's proprietary curation pipeline. The curation process selected high-quality, classification-relevant subsets from the 10B available pairs in DataComp-XL.

Evaluation Results

Zero-shot Classification Performance

Benchmark DatologyAI SigLIP2 MetaCLIP
ImageNet1k 76.91% 74.0% 67.7%
ImageNetv2 70.2% 67.1% 60.4%

Training Efficiency

  • Matches SigLIP2 performance with only 5B samples (87.5% compute reduction)
  • Matches MetaCLIP performance with only 1B samples (92% compute reduction)

Full details see blog post.

Model Details

  • Developed by: DatologyAI
  • Model type: CLIP (Contrastive Language-Image Pre-training)
  • Architecture: Vision Transformer B/32
  • License: Apache 2.0
  • Training framework: OpenCLIP 2.24.0

Technical Specifications

Model Architecture

  • Vision Encoder: ViT-B/32 (86M parameters)
    • Patch size: 32×32
    • Image size: 224×224
    • Embedding dimension: 512
  • Text Encoder: 12-layer Transformer
    • Context length: 77 tokens
    • Vocabulary size: 49,408 (BPE tokenizer)

Training Configuration

  • Optimizer: AdamW (β1=0.9, β2=0.98, ε=1e-6)
  • Learning rate: 5.0e-04 with cosine schedule
  • Weight decay: 0.1
  • Batch size: 32,768
  • Training samples: 13B image-text pairs
  • Hardware: Distributed training on H100 GPUs

Citation

If you use this model, please cite:

@article{datologyai2025clip,
  title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
  author={DatologyAI Team},
  journal={DatologyAI Blog},
  year={2025},
  url={https://datologyai.com/blog/clip-data-upgrade}
}

Additional Information

For more details on our data curation methodology and comprehensive benchmark results, please visit our blog post.

Contact: [email protected]

Model Card Contact

DatologyAI Team - [email protected]

Downloads last month
419
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including DatologyAI/cls-opt-vit-b-32