facebook
/

webssl-mae3b-full2b-224

Image Feature Extraction

Model card Files Files and versions

webssl-mae3b-full2b-224 / README.md

davidfan97's picture

Initial commit

a73c7f2 verified 5 months ago

|

history blame contribute delete

2.26 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	inference: false
	---
	# Web-SSL MAE ViT-3B: 2B MetaCLIP data, 224 Resolution
	A 3 billion parameter Vision Transformer (ViT) trained with Masked Autoencoder (MAE) self-supervised learning on web-scale image data without language supervision. Introduced in ["Scaling Language-Free Visual Representation Learning"](https://arxiv.org/abs/2504.01017) (Fan et al., 2025).

	## Model Details
	- Architecture: ViT (3072 width, 26 depth, 24 heads)
	- Parameters: 3B
	- Resolution: 224×224 pixels
	- Training: Self-supervised Web-MAE on 2B image samples from MetaCLIP web data

	## Model Descriptions
	Web-SSL MAE 3B is a 3 billion parameter Vision Transformer model trained using masked autoencoder self-supervised learning on 2 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language-supervised models like CLIP across various vision tasks. Web-MAE exhibits particularly strong performance on OCR & Chart understanding tasks while maintaining competitive performance across traditional vision benchmarks and multimodal tasks.

	<img src="webssl_teaser.png" alt="WebSSL Model Overview" width="600">

	## Usage
	```python
	from transformers import AutoImageProcessor, ViTModel
	import torch
	from PIL import Image

	# Adjust the size, crop_size, etc. fields to your liking
	processor = AutoImageProcessor.from_pretrained('facebook/webssl-mae3b-full2b-224')
	model = ViTModel.from_pretrained('facebook/webssl-mae3b-full2b-224').cuda().eval()

	# Process an image
	image = Image.open('path/to/image.jpg')
	inputs = processor(images=image, return_tensors="pt").to('cuda')
	with torch.no_grad():
	outputs = model(**inputs)

	# Extract features from the encoder
	encoder_hidden_states = outputs.last_hidden_state
	```

	## Citation

	```bibtex
	@article{fan2025scaling,
	title={Scaling Language-Free Visual Representation Learning},
	author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
	year={2025},
	eprint={2504.01017},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```