File size: 5,172 Bytes
2642b57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
library_name: transformers
license: mit
pipeline_tag: image-feature-extraction
tags:
- clip
---

# OpenCLIP ViT-L/14 with Test-Time Register

Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps.
In [Vision Transformers Don't Need *Trained* Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These *test-time registers* serve a similar purpose
as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc. 

## Model description

The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations
are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers.
This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks.

<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_attention.png" alt="drawing" width="600"/>
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_patchnorms.png" alt="drawing" width="600"/>




## Quick Start

```python
from transformers import AutoModel
from PIL import Image
import torch

# Load the complete model with all components
model = AutoModel.from_pretrained(
    "amildravid4292/clip-vitl14-test-time-registers", 
    trust_remote_code=True
)

# Check what was loaded
print(f"Register tokens: {model.num_register_tokens}")
print(f"Neuron dict: {model.neuron_dict}")
print(f"Tokenizer available: {model.tokenizer is not None}")
print(f"Preprocessor available: {model.preprocessor is not None}")
print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}")
```

## Usage Examples



### Image Processing
```python
from PIL import Image

# Load and preprocess image
image = Image.open("your_image.jpg")
image_tensor = model.preprocess_image(image).unsqueeze(0)

image_features = model.encode_image(
    image_tensor
)

# to run inference with the original model without test-time registers
image_features = model.encode_image(
    image_tensor,
    neuron_dict=None,
    num_register_tokens=0
)

```

### Text Processing
```python
# Tokenize text
text = ["a photo of a cat", "a photo of a dog"]
text_tokens = model.tokenize(text)

# Encode text
text_features = model.encode_text(text_tokens)
```



### Complete Pipeline
```python

# load model
model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True)
model = model.to(device).bfloat16()
classifier = model.zeroshot_classifier.to(device).bfloat16()

# load data
imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor)
ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))]
loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False)

# run zero-shot classification
with torch.no_grad():
    correct = [0, 0]
    for i, (images, target) in enumerate(tqdm(loader)):
        images = images.to(device).bfloat16()
        
        target = target.to(device).bfloat16()
    
        
        # predict
        image_features = model.encode_image(images) 
        
        image_features /= image_features.norm(dim=-1, keepdim=True)
        logits = 100. * image_features @ classifier

        pred = logits.argmax(dim=-1)
        correct[0] += (pred == target).sum().item()
        correct[1] += target.size(0)
        
       
        
print(correct[0]/correct[1])
```

## Advanced Usage

### Custom Neuron Modifications
```python
# Override the saved neuron configuration
custom_neuron_dict = {0: [10, 20, 30]}  # Modify neurons 10,20,30 in layer 0

image_features = model.encode_image(
    image_tensor,
    num_register_tokens=4,
    neuron_dict=custom_neuron_dict
)
```

### Different Register Token Counts
```python
# Use different number of register tokens
image_features = model.encode_image(
    image_tensor,
    num_register_tokens=8  # Override the default
)
```

## Model Details

- **Base Architecture**: ViT-L/14
- **Training Data**: LAION-2B subset


### BibTeX entry and citation info

```bibtex
@misc{jiang2025visiontransformersdontneed,
      title={Vision Transformers Don't Need Trained Registers}, 
      author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
      year={2025},
      eprint={2506.08010},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.08010}, 
}
```