File size: 5,172 Bytes
2642b57 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
library_name: transformers
license: mit
pipeline_tag: image-feature-extraction
tags:
- clip
---
# OpenCLIP ViT-L/14 with Test-Time Register
Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps.
In [Vision Transformers Don't Need *Trained* Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These *test-time registers* serve a similar purpose
as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc.
## Model description
The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations
are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers.
This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks.
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_attention.png" alt="drawing" width="600"/>
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_patchnorms.png" alt="drawing" width="600"/>
## Quick Start
```python
from transformers import AutoModel
from PIL import Image
import torch
# Load the complete model with all components
model = AutoModel.from_pretrained(
"amildravid4292/clip-vitl14-test-time-registers",
trust_remote_code=True
)
# Check what was loaded
print(f"Register tokens: {model.num_register_tokens}")
print(f"Neuron dict: {model.neuron_dict}")
print(f"Tokenizer available: {model.tokenizer is not None}")
print(f"Preprocessor available: {model.preprocessor is not None}")
print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}")
```
## Usage Examples
### Image Processing
```python
from PIL import Image
# Load and preprocess image
image = Image.open("your_image.jpg")
image_tensor = model.preprocess_image(image).unsqueeze(0)
image_features = model.encode_image(
image_tensor
)
# to run inference with the original model without test-time registers
image_features = model.encode_image(
image_tensor,
neuron_dict=None,
num_register_tokens=0
)
```
### Text Processing
```python
# Tokenize text
text = ["a photo of a cat", "a photo of a dog"]
text_tokens = model.tokenize(text)
# Encode text
text_features = model.encode_text(text_tokens)
```
### Complete Pipeline
```python
# load model
model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True)
model = model.to(device).bfloat16()
classifier = model.zeroshot_classifier.to(device).bfloat16()
# load data
imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor)
ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))]
loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False)
# run zero-shot classification
with torch.no_grad():
correct = [0, 0]
for i, (images, target) in enumerate(tqdm(loader)):
images = images.to(device).bfloat16()
target = target.to(device).bfloat16()
# predict
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
logits = 100. * image_features @ classifier
pred = logits.argmax(dim=-1)
correct[0] += (pred == target).sum().item()
correct[1] += target.size(0)
print(correct[0]/correct[1])
```
## Advanced Usage
### Custom Neuron Modifications
```python
# Override the saved neuron configuration
custom_neuron_dict = {0: [10, 20, 30]} # Modify neurons 10,20,30 in layer 0
image_features = model.encode_image(
image_tensor,
num_register_tokens=4,
neuron_dict=custom_neuron_dict
)
```
### Different Register Token Counts
```python
# Use different number of register tokens
image_features = model.encode_image(
image_tensor,
num_register_tokens=8 # Override the default
)
```
## Model Details
- **Base Architecture**: ViT-L/14
- **Training Data**: LAION-2B subset
### BibTeX entry and citation info
```bibtex
@misc{jiang2025visiontransformersdontneed,
title={Vision Transformers Don't Need Trained Registers},
author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
year={2025},
eprint={2506.08010},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.08010},
}
``` |