File size: 5,172 Bytes

2642b57

---
library_name: transformers
license: mit
pipeline_tag: image-feature-extraction
tags:
- clip
---

# OpenCLIP ViT-L/14 with Test-Time Register

Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps.
In [Vision Transformers Don't Need *Trained* Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These *test-time registers* serve a similar purpose
as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc. 

## Model description

The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations
are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers.
This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks.

<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_attention.png" alt="drawing" width="600"/>
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_patchnorms.png" alt="drawing" width="600"/>




## Quick Start

```python
from transformers import AutoModel
from PIL import Image
import torch

# Load the complete model with all components
model = AutoModel.from_pretrained(
    "amildravid4292/clip-vitl14-test-time-registers", 
    trust_remote_code=True
)

# Check what was loaded
print(f"Register tokens: {model.num_register_tokens}")
print(f"Neuron dict: {model.neuron_dict}")
print(f"Tokenizer available: {model.tokenizer is not None}")
print(f"Preprocessor available: {model.preprocessor is not None}")
print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}")
```

## Usage Examples



### Image Processing
```python
from PIL import Image

# Load and preprocess image
image = Image.open("your_image.jpg")
image_tensor = model.preprocess_image(image).unsqueeze(0)

image_features = model.encode_image(
    image_tensor
)

# to run inference with the original model without test-time registers
image_features = model.encode_image(
    image_tensor,
    neuron_dict=None,
    num_register_tokens=0
)

```

### Text Processing
```python
# Tokenize text
text = ["a photo of a cat", "a photo of a dog"]
text_tokens = model.tokenize(text)

# Encode text
text_features = model.encode_text(text_tokens)
```



### Complete Pipeline
```python

# load model
model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True)
model = model.to(device).bfloat16()
classifier = model.zeroshot_classifier.to(device).bfloat16()

# load data
imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor)
ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))]
loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False)

# run zero-shot classification
with torch.no_grad():
    correct = [0, 0]
    for i, (images, target) in enumerate(tqdm(loader)):
        images = images.to(device).bfloat16()
        
        target = target.to(device).bfloat16()
    
        
        # predict
        image_features = model.encode_image(images) 
        
        image_features /= image_features.norm(dim=-1, keepdim=True)
        logits = 100. * image_features @ classifier

        pred = logits.argmax(dim=-1)
        correct[0] += (pred == target).sum().item()
        correct[1] += target.size(0)
        
       
        
print(correct[0]/correct[1])
```

## Advanced Usage

### Custom Neuron Modifications
```python
# Override the saved neuron configuration
custom_neuron_dict = {0: [10, 20, 30]}  # Modify neurons 10,20,30 in layer 0

image_features = model.encode_image(
    image_tensor,
    num_register_tokens=4,
    neuron_dict=custom_neuron_dict
)
```

### Different Register Token Counts
```python
# Use different number of register tokens
image_features = model.encode_image(
    image_tensor,
    num_register_tokens=8  # Override the default
)
```

## Model Details

- **Base Architecture**: ViT-L/14
- **Training Data**: LAION-2B subset


### BibTeX entry and citation info

```bibtex
@misc{jiang2025visiontransformersdontneed,
      title={Vision Transformers Don't Need Trained Registers}, 
      author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
      year={2025},
      eprint={2506.08010},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.08010}, 
}
```