A CLIP (Contrastive Language-Image Pre-training) model finetuned on EntityNet-33M. The base model is ViT-B-32/datacomp_xl_s13b_b90k.

See the project page for the paper, code, usage examples, metrics, etc.

The model has seen ~13B images at a batch size of 90k during pretraining, and ~0.6B images at a batch size of 32k during finetuning.

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support