OpenCLIP ViT-L/14 with Test-Time Register

Register tokens in ViTs were introduced as learnable tokens in Vision Transformers Need Registers to mitigate artifacts in intermediate feature maps. In Vision Transformers Don't Need Trained Registers, we introduced a training-free method to create registers. These test-time registers serve a similar purpose as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc.

Model description

The base model is OpenCLIP-ViT-L-14-laion2B-s32B-b82K. With test-time registers, the model's internal representations are cleaner (see below). Using the environment from here and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers. This model is intended to be used with this repo. Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks.

Quick Start

from transformers import AutoModel
from PIL import Image
import torch

# Load the complete model with all components
model = AutoModel.from_pretrained(
    "amildravid4292/clip-vitl14-test-time-registers", 
    trust_remote_code=True
)

# Check what was loaded
print(f"Register tokens: {model.num_register_tokens}")
print(f"Neuron dict: {model.neuron_dict}")
print(f"Tokenizer available: {model.tokenizer is not None}")
print(f"Preprocessor available: {model.preprocessor is not None}")
print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}")

Usage Examples

Image Processing

from PIL import Image

# Load and preprocess image
image = Image.open("your_image.jpg")
image_tensor = model.preprocess_image(image).unsqueeze(0)

image_features = model.encode_image(
    image_tensor
)

# to run inference with the original model without test-time registers
image_features = model.encode_image(
    image_tensor,
    neuron_dict=None,
    num_register_tokens=0
)

Text Processing

# Tokenize text
text = ["a photo of a cat", "a photo of a dog"]
text_tokens = model.tokenize(text)

# Encode text
text_features = model.encode_text(text_tokens)

Complete Pipeline


# load model
model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True)
model = model.to(device).bfloat16()
classifier = model.zeroshot_classifier.to(device).bfloat16()

# load data
imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor)
ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))]
loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False)

# run zero-shot classification
with torch.no_grad():
    correct = [0, 0]
    for i, (images, target) in enumerate(tqdm(loader)):
        images = images.to(device).bfloat16()
        
        target = target.to(device).bfloat16()
    
        
        # predict
        image_features = model.encode_image(images) 
        
        image_features /= image_features.norm(dim=-1, keepdim=True)
        logits = 100. * image_features @ classifier

        pred = logits.argmax(dim=-1)
        correct[0] += (pred == target).sum().item()
        correct[1] += target.size(0)
        
       
        
print(correct[0]/correct[1])

Advanced Usage

Custom Neuron Modifications

# Override the saved neuron configuration
custom_neuron_dict = {0: [10, 20, 30]}  # Modify neurons 10,20,30 in layer 0

image_features = model.encode_image(
    image_tensor,
    num_register_tokens=4,
    neuron_dict=custom_neuron_dict
)

Different Register Token Counts

# Use different number of register tokens
image_features = model.encode_image(
    image_tensor,
    num_register_tokens=8  # Override the default
)

Model Details

Base Architecture: ViT-L/14
Training Data: LAION-2B subset

BibTeX entry and citation info

@misc{jiang2025visiontransformersdontneed,
      title={Vision Transformers Don't Need Trained Registers}, 
      author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
      year={2025},
      eprint={2506.08010},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.08010}, 
}

Downloads last month: 65

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support