license: apple-amlr
library_name: mobileclip
MobileCLIP2: Improving Multi-Modal Reinforced Training
MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari.
This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets.
Highlights
- MobileCLIP2-S4matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max.
- MobileCLIP-S3/S4are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines).
- Our smallest variant MobileCLIP-S0obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller.
- MobileCLIP-S2obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.
- MobileCLIP-B (LT)attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336.
Checkpoints
| Model | # Seen Samples (B) | # Params (M) (img + txt) | Latency (ms) (img + txt) | IN-1k Zero-Shot Top-1 Acc. (%) | Avg. Perf. (%) on 38 datasets | 
|---|---|---|---|---|---|
| MobileCLIP2-S0 | 13 | 11.4 + 63.4 | 1.5 + 3.3 | 71.5 | 59.7 | 
| MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 | 
| MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 | 
| MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 | 
| MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 | 
| MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 | 
| MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 | 
| MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 | 
| MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 | 
| MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 | 
| MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 | 
| MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 | 
| MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 | 
| MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 | 
How to Use
First, download the desired checkpoint visiting one of the links in the table above, then click the Files and versions tab, and download the PyTorch checkpoint.
For programmatic downloading, if you have huggingface_hub installed, you can also run:
hf download apple/mobileclip2_coca_dfn2b_s13b_<finetune-dataset>_context<length>
For models length with context lengths 128/256, copy config.json to src/open_clip/model_configs/coca_ViT-L-14-context$len.json and change the model name in below example to coca_ViT-L-14-context$len.
import torch
import open_clip
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms('coca_ViT-L-14', pretrained='/path/to/mobileclip2_coca.pt')
model.eval()
image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert('RGB')).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
    syn_text = model.generate(
        image,
        generation_type="top_p",
        top_p=0.9,
        fixed_output_length=True
    )[0]
    syn_text = open_clip.decode(syn_text).split("<end_of_text>")[0].split("<start_of_text>")[-1].split(".")[0].rstrip()
print("Caption:", syn_text)

