--- library_name: transformers license: mit pipeline_tag: image-feature-extraction tags: - clip --- # OpenCLIP ViT-L/14 with Test-Time Register Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps. In [Vision Transformers Don't Need *Trained* Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These *test-time registers* serve a similar purpose as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc. ## Model description The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers. This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks. drawing drawing ## Quick Start ```python from transformers import AutoModel from PIL import Image import torch # Load the complete model with all components model = AutoModel.from_pretrained( "amildravid4292/clip-vitl14-test-time-registers", trust_remote_code=True ) # Check what was loaded print(f"Register tokens: {model.num_register_tokens}") print(f"Neuron dict: {model.neuron_dict}") print(f"Tokenizer available: {model.tokenizer is not None}") print(f"Preprocessor available: {model.preprocessor is not None}") print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}") ``` ## Usage Examples ### Image Processing ```python from PIL import Image # Load and preprocess image image = Image.open("your_image.jpg") image_tensor = model.preprocess_image(image).unsqueeze(0) image_features = model.encode_image( image_tensor ) # to run inference with the original model without test-time registers image_features = model.encode_image( image_tensor, neuron_dict=None, num_register_tokens=0 ) ``` ### Text Processing ```python # Tokenize text text = ["a photo of a cat", "a photo of a dog"] text_tokens = model.tokenize(text) # Encode text text_features = model.encode_text(text_tokens) ``` ### Complete Pipeline ```python # load model model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True) model = model.to(device).bfloat16() classifier = model.zeroshot_classifier.to(device).bfloat16() # load data imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor) ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))] loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False) # run zero-shot classification with torch.no_grad(): correct = [0, 0] for i, (images, target) in enumerate(tqdm(loader)): images = images.to(device).bfloat16() target = target.to(device).bfloat16() # predict image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True) logits = 100. * image_features @ classifier pred = logits.argmax(dim=-1) correct[0] += (pred == target).sum().item() correct[1] += target.size(0) print(correct[0]/correct[1]) ``` ## Advanced Usage ### Custom Neuron Modifications ```python # Override the saved neuron configuration custom_neuron_dict = {0: [10, 20, 30]} # Modify neurons 10,20,30 in layer 0 image_features = model.encode_image( image_tensor, num_register_tokens=4, neuron_dict=custom_neuron_dict ) ``` ### Different Register Token Counts ```python # Use different number of register tokens image_features = model.encode_image( image_tensor, num_register_tokens=8 # Override the default ) ``` ## Model Details - **Base Architecture**: ViT-L/14 - **Training Data**: LAION-2B subset ### BibTeX entry and citation info ```bibtex @misc{jiang2025visiontransformersdontneed, title={Vision Transformers Don't Need Trained Registers}, author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman}, year={2025}, eprint={2506.08010}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.08010}, } ```