Web-SSL DINO ViT-3B: Heavy Filtered 2B MetaCLIP data, 224 Resolution

A 3 billion parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on heavily filtered web-scale image data without language supervision. Introduced in "Scaling Language-Free Visual Representation Learning" (Fan et al., 2025).

Model Details

Architecture: ViT (3072 width, 26 depth, 24 heads)
Parameters: 3B
Resolution: 224×224 pixels
Training: Self-supervised Web-DINO on heavily filtered MetaCLIP data

Model Descriptions

Web-SSL DINO 3B is a 3 billion parameter Vision Transformer model trained using self-supervised learning on heavily filtered web images without language supervision. The "heavy2b" designation indicates training on a subset of images containing charts, tables, and documents with readable text, representing only 1.3% of the original MetaCLIP dataset. This focused filtering significantly improves OCR & Chart understanding capabilities while maintaining strong performance on other vision tasks. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language-supervised models like CLIP across various vision tasks.

Usage

from transformers import AutoImageProcessor, Dinov2Model
import torch
from PIL import Image

processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino3b-heavy2b-224')
model = Dinov2Model.from_pretrained('facebook/webssl-dino3b-heavy2b-224')

# Process an image
image = Image.open('path/to/image.jpg')
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

cls_features = outputs.last_hidden_state[:, 0]  # CLS token features
patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features

Citation

@article{fan2025scaling,
  title={Scaling Language-Free Visual Representation Learning}, 
  author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
  year={2025},
  eprint={2504.01017},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

facebook
/

webssl-dino3b-heavy2b-224

Web-SSL DINO ViT-3B: Heavy Filtered 2B MetaCLIP data, 224 Resolution

Model Details

Model Descriptions

Usage

Citation

Collection including facebook/webssl-dino3b-heavy2b-224

Web-SSL