|
--- |
|
library_name: transformers |
|
license: mit |
|
pipeline_tag: image-feature-extraction |
|
tags: |
|
- clip |
|
--- |
|
|
|
# OpenCLIP ViT-L/14 with Test-Time Register |
|
|
|
Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps. |
|
In [Vision Transformers Don't Need *Trained* Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These *test-time registers* serve a similar purpose |
|
as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc. |
|
|
|
## Model description |
|
|
|
The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations |
|
are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers. |
|
This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks. |
|
|
|
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_attention.png" alt="drawing" width="600"/> |
|
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_patchnorms.png" alt="drawing" width="600"/> |
|
|
|
|
|
|
|
|
|
## Quick Start |
|
|
|
```python |
|
from transformers import AutoModel |
|
from PIL import Image |
|
import torch |
|
|
|
# Load the complete model with all components |
|
model = AutoModel.from_pretrained( |
|
"amildravid4292/clip-vitl14-test-time-registers", |
|
trust_remote_code=True |
|
) |
|
|
|
# Check what was loaded |
|
print(f"Register tokens: {model.num_register_tokens}") |
|
print(f"Neuron dict: {model.neuron_dict}") |
|
print(f"Tokenizer available: {model.tokenizer is not None}") |
|
print(f"Preprocessor available: {model.preprocessor is not None}") |
|
print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}") |
|
``` |
|
|
|
## Usage Examples |
|
|
|
|
|
|
|
### Image Processing |
|
```python |
|
from PIL import Image |
|
|
|
# Load and preprocess image |
|
image = Image.open("your_image.jpg") |
|
image_tensor = model.preprocess_image(image).unsqueeze(0) |
|
|
|
image_features = model.encode_image( |
|
image_tensor |
|
) |
|
|
|
# to run inference with the original model without test-time registers |
|
image_features = model.encode_image( |
|
image_tensor, |
|
neuron_dict=None, |
|
num_register_tokens=0 |
|
) |
|
|
|
``` |
|
|
|
### Text Processing |
|
```python |
|
# Tokenize text |
|
text = ["a photo of a cat", "a photo of a dog"] |
|
text_tokens = model.tokenize(text) |
|
|
|
# Encode text |
|
text_features = model.encode_text(text_tokens) |
|
``` |
|
|
|
|
|
|
|
### Complete Pipeline |
|
```python |
|
|
|
# load model |
|
model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True) |
|
model = model.to(device).bfloat16() |
|
classifier = model.zeroshot_classifier.to(device).bfloat16() |
|
|
|
# load data |
|
imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor) |
|
ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))] |
|
loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False) |
|
|
|
# run zero-shot classification |
|
with torch.no_grad(): |
|
correct = [0, 0] |
|
for i, (images, target) in enumerate(tqdm(loader)): |
|
images = images.to(device).bfloat16() |
|
|
|
target = target.to(device).bfloat16() |
|
|
|
|
|
# predict |
|
image_features = model.encode_image(images) |
|
|
|
image_features /= image_features.norm(dim=-1, keepdim=True) |
|
logits = 100. * image_features @ classifier |
|
|
|
pred = logits.argmax(dim=-1) |
|
correct[0] += (pred == target).sum().item() |
|
correct[1] += target.size(0) |
|
|
|
|
|
|
|
print(correct[0]/correct[1]) |
|
``` |
|
|
|
## Advanced Usage |
|
|
|
### Custom Neuron Modifications |
|
```python |
|
# Override the saved neuron configuration |
|
custom_neuron_dict = {0: [10, 20, 30]} # Modify neurons 10,20,30 in layer 0 |
|
|
|
image_features = model.encode_image( |
|
image_tensor, |
|
num_register_tokens=4, |
|
neuron_dict=custom_neuron_dict |
|
) |
|
``` |
|
|
|
### Different Register Token Counts |
|
```python |
|
# Use different number of register tokens |
|
image_features = model.encode_image( |
|
image_tensor, |
|
num_register_tokens=8 # Override the default |
|
) |
|
``` |
|
|
|
## Model Details |
|
|
|
- **Base Architecture**: ViT-L/14 |
|
- **Training Data**: LAION-2B subset |
|
|
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@misc{jiang2025visiontransformersdontneed, |
|
title={Vision Transformers Don't Need Trained Registers}, |
|
author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman}, |
|
year={2025}, |
|
eprint={2506.08010}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2506.08010}, |
|
} |
|
``` |