clipL336_TTR / README.md
JH-C-k's picture
Add files using upload-large-folder tool
2642b57 verified
---
library_name: transformers
license: mit
pipeline_tag: image-feature-extraction
tags:
- clip
---
# OpenCLIP ViT-L/14 with Test-Time Register
Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps.
In [Vision Transformers Don't Need *Trained* Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These *test-time registers* serve a similar purpose
as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc.
## Model description
The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations
are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers.
This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks.
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_attention.png" alt="drawing" width="600"/>
<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_patchnorms.png" alt="drawing" width="600"/>
## Quick Start
```python
from transformers import AutoModel
from PIL import Image
import torch
# Load the complete model with all components
model = AutoModel.from_pretrained(
"amildravid4292/clip-vitl14-test-time-registers",
trust_remote_code=True
)
# Check what was loaded
print(f"Register tokens: {model.num_register_tokens}")
print(f"Neuron dict: {model.neuron_dict}")
print(f"Tokenizer available: {model.tokenizer is not None}")
print(f"Preprocessor available: {model.preprocessor is not None}")
print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}")
```
## Usage Examples
### Image Processing
```python
from PIL import Image
# Load and preprocess image
image = Image.open("your_image.jpg")
image_tensor = model.preprocess_image(image).unsqueeze(0)
image_features = model.encode_image(
image_tensor
)
# to run inference with the original model without test-time registers
image_features = model.encode_image(
image_tensor,
neuron_dict=None,
num_register_tokens=0
)
```
### Text Processing
```python
# Tokenize text
text = ["a photo of a cat", "a photo of a dog"]
text_tokens = model.tokenize(text)
# Encode text
text_features = model.encode_text(text_tokens)
```
### Complete Pipeline
```python
# load model
model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True)
model = model.to(device).bfloat16()
classifier = model.zeroshot_classifier.to(device).bfloat16()
# load data
imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor)
ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))]
loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False)
# run zero-shot classification
with torch.no_grad():
correct = [0, 0]
for i, (images, target) in enumerate(tqdm(loader)):
images = images.to(device).bfloat16()
target = target.to(device).bfloat16()
# predict
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
logits = 100. * image_features @ classifier
pred = logits.argmax(dim=-1)
correct[0] += (pred == target).sum().item()
correct[1] += target.size(0)
print(correct[0]/correct[1])
```
## Advanced Usage
### Custom Neuron Modifications
```python
# Override the saved neuron configuration
custom_neuron_dict = {0: [10, 20, 30]} # Modify neurons 10,20,30 in layer 0
image_features = model.encode_image(
image_tensor,
num_register_tokens=4,
neuron_dict=custom_neuron_dict
)
```
### Different Register Token Counts
```python
# Use different number of register tokens
image_features = model.encode_image(
image_tensor,
num_register_tokens=8 # Override the default
)
```
## Model Details
- **Base Architecture**: ViT-L/14
- **Training Data**: LAION-2B subset
### BibTeX entry and citation info
```bibtex
@misc{jiang2025visiontransformersdontneed,
title={Vision Transformers Don't Need Trained Registers},
author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
year={2025},
eprint={2506.08010},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.08010},
}
```