What libraries can I use for Zero-Shot Image Classification?

The transformersand transformers.js libraries are compatible with Zero-Shot Image Classification.

What models can I use for Zero-Shot Image Classification?

The visheratin/mexma-siglip, google/siglip2-base-patch16-224, intfloat/mmE5-mllama-11b-instruct, jinaai/jina-clip-v2, and microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 models can be used for Zero-Shot Image Classification.

What datasets can I use for Zero-Shot Image Classification?

The and dataset can be used for Zero-Shot Image Classification.

What metrics can I use for Zero-Shot Image Classification?

The and top-K accuracy metric can be used for Zero-Shot Image Classification.

Tasks

Zero-Shot Image Classification

Zero-shot image classification is the task of classifying previously unseen classes during training of a model.

Inputs

Classes

cat, dog, bird

Zero-Shot Image Classification Model

Output

Cat

0.664

Dog

0.329

Bird

0.008

About Zero-Shot Image Classification

About the Task

Zero-shot image classification is a computer vision task to classify images into one of several classes, without any prior training or knowledge of the classes.

Zero shot image classification works by transferring knowledge learnt during training of one model, to classify novel classes that was not present in the training data. So this is a variation of transfer learning. For instance, a model trained to differentiate cars from airplanes can be used to classify images of ships.

The data in this learning paradigm consists of

Seen data - images and their corresponding labels
Unseen data - only labels and no images
Auxiliary information - additional information given to the model during training connecting the unseen and seen data. This can be in the form of textual description or word embeddings.

Use Cases

Image Retrieval

Zero-shot learning resolves several challenges in image retrieval systems. For example, with the rapid growth of categories on the web, it is challenging to index images based on unseen categories. With zero-shot learning we can associate unseen categories to images by exploiting attributes to model the relationships among visual features and labels.

Action Recognition

Action recognition is the task of identifying when a person in an image/video is performing a given action from a set of actions. If all the possible actions are not known beforehand, conventional deep learning models fail. With zero-shot learning, for a given domain of a set of actions, we can create a mapping connecting low-level features and a semantic description of auxiliary data to classify unknown classes of actions.

Task Variants

You can contribute variants of this task here.

Inference

The model can be loaded with the zero-shot-image-classification pipeline like so:

from transformers import pipeline
# More models in the model hub.
model_name = "openai/clip-vit-large-patch14-336"
classifier = pipeline("zero-shot-image-classification", model = model_name)

You can then use this pipeline to classify images into any of the class names you specify. You can specify more than two class labels too.

image_to_classify = "path_to_cat_and_dog_image.jpeg"
labels_for_classification =  ["cat and dog",
                              "lion and cheetah",
                              "rabbit and lion"]
scores = classifier(image_to_classify,
                    candidate_labels = labels_for_classification)

The classifier would return a list of dictionaries after the inference which is stored in the variable scores in the code snippet above. Variable scores would look as follows:

[{'score': 0.9950482249259949, 'label': 'cat and dog'},
{'score': 0.004863627254962921, 'label': 'rabbit and lion'},
{'score': 8.816882473183796e-05, 'label': 'lion and cheetah'}]

The dictionary at the zeroth index of the list will contain the label with the highest score.

print(f"The highest score is {scores[0]['score']:.3f} for the label {scores[0]['label']}")

The output from the print statement above would look as follows:

The highest probability is 0.995 for the label cat and dog

Useful Resources

This page was made possible thanks to the efforts of Shamima Hossain, Haider Zaidi and Paarth Bhatnagar.

Compatible libraries

Transformers

Transformers.js

Zero-Shot Image Classification demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Zero-Shot Image Classification

Browse Models (1,135)

visheratin/mexma-siglip

Zero-Shot Image Classification • 1.0B • Updated Mar 2, 2025 • 2 • 3

Note Multilingual image classification model for 80 languages.

google/siglip2-base-patch16-224

Zero-Shot Image Classification • Updated Feb 21, 2025 • 361k • 90

Note Strong zero-shot image classification model.

intfloat/mmE5-mllama-11b-instruct

Zero-Shot Image Classification • 11B • Updated Feb 27, 2025 • 57 • 20

Note Robust zero-shot image classification model.

jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 28, 2025 • 29.1k • 318

Note Powerful zero-shot image classification model supporting 94 languages.

microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

Zero-Shot Image Classification • Updated Jan 14, 2025 • 530k • 385

Note Strong image classification model for biomedical domain.

Datasets for Zero-Shot Image Classification

Browse Datasets (291)

No example dataset is defined for this task.

Note Contribute by proposing a dataset for this task !

Spaces using Zero-Shot Image Classification

🏃

merve/compare_clip_siglip

Note An application to compare different zero-shot image classification models.

Metrics for Zero-Shot Image Classification

top-K accuracy: Computes the number of times the correct label appears in top K labels predicted