clipL336_TTR / README.md

Add files using upload-large-folder tool

2642b57 verified 30 days ago

5.17 kB

	---
	library_name: transformers
	license: mit
	pipeline_tag: image-feature-extraction
	tags:
	- clip
	---

	# OpenCLIP ViT-L/14 with Test-Time Register

	Register tokens in ViTs were introduced as learnable tokens in [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588) to mitigate artifacts in intermediate feature maps.
	In [Vision Transformers Don't Need Trained Registers](https://arxiv.org/abs/2506.08010), we introduced a training-free method to create registers. These test-time registers serve a similar purpose
	as the original trained registers, but can be added post-hoc to any ViT to mitigate artifacts, enhance model interpretability, and modestly improve downstream performance in tasks such as segmentation, depth estimation, etc.

	## Model description

	The base model is [OpenCLIP-ViT-L-14-laion2B-s32B-b82K](https://huggingface.co/laion/CLIP-ViT-L-14-laion2B-s32B-b82K). With test-time registers, the model's internal representations
	are cleaner (see below). Using the environment from [here](https://github.com/nickjiang2378/test-time-registers/blob/main/environment.yml) and evaluating using bfloat16 leads to IN-1k zeroshot performance of 76.4 for both the original model and the variant with test-time registers.
	This model is intended to be used with this [repo](https://github.com/nickjiang2378/test-time-registers). Use transformers==4.45.1. The model can also be used for fine-tuning or other downstream tasks.

	<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_attention.png" alt="drawing" width="600"/>
	<img src="https://huggingface.co/amildravid4292/clip-vitl14-test-time-registers/resolve/main/vitl14_patchnorms.png" alt="drawing" width="600"/>




	## Quick Start

	```python
	from transformers import AutoModel
	from PIL import Image
	import torch

	# Load the complete model with all components
	model = AutoModel.from_pretrained(
	"amildravid4292/clip-vitl14-test-time-registers",
	trust_remote_code=True
	)

	# Check what was loaded
	print(f"Register tokens: {model.num_register_tokens}")
	print(f"Neuron dict: {model.neuron_dict}")
	print(f"Tokenizer available: {model.tokenizer is not None}")
	print(f"Preprocessor available: {model.preprocessor is not None}")
	print(f"Zero-shot classifier available: {model.zeroshot_classifier is not None}")
	```

	## Usage Examples



	### Image Processing
	```python
	from PIL import Image

	# Load and preprocess image
	image = Image.open("your_image.jpg")
	image_tensor = model.preprocess_image(image).unsqueeze(0)

	image_features = model.encode_image(
	image_tensor
	)

	# to run inference with the original model without test-time registers
	image_features = model.encode_image(
	image_tensor,
	neuron_dict=None,
	num_register_tokens=0
	)

	```

	### Text Processing
	```python
	# Tokenize text
	text = ["a photo of a cat", "a photo of a dog"]
	text_tokens = model.tokenize(text)

	# Encode text
	text_features = model.encode_text(text_tokens)
	```



	### Complete Pipeline
	```python

	# load model
	model = AutoModel.from_pretrained('amildravid4292/clip-vitl14-test-time-registers', trust_remote_code=True)
	model = model.to(device).bfloat16()
	classifier = model.zeroshot_classifier.to(device).bfloat16()

	# load data
	imagenet_dataset = ImageNet(root='/datasets/ilsvrc/current', split='val', transform=model.preprocessor)
	ground_truth_labels = [imagenet_dataset.targets[i] for i in range(len(imagenet_dataset))]
	loader = torch.utils.data.DataLoader(imagenet_dataset, batch_size=100, num_workers=4, pin_memory=True, shuffle=False)

	# run zero-shot classification
	with torch.no_grad():
	correct = [0, 0]
	for i, (images, target) in enumerate(tqdm(loader)):
	images = images.to(device).bfloat16()

	target = target.to(device).bfloat16()


	# predict
	image_features = model.encode_image(images)

	image_features /= image_features.norm(dim=-1, keepdim=True)
	logits = 100. * image_features @ classifier

	pred = logits.argmax(dim=-1)
	correct[0] += (pred == target).sum().item()
	correct[1] += target.size(0)



	print(correct[0]/correct[1])
	```

	## Advanced Usage

	### Custom Neuron Modifications
	```python
	# Override the saved neuron configuration
	custom_neuron_dict = {0: [10, 20, 30]} # Modify neurons 10,20,30 in layer 0

	image_features = model.encode_image(
	image_tensor,
	num_register_tokens=4,
	neuron_dict=custom_neuron_dict
	)
	```

	### Different Register Token Counts
	```python
	# Use different number of register tokens
	image_features = model.encode_image(
	image_tensor,
	num_register_tokens=8 # Override the default
	)
	```

	## Model Details

	- Base Architecture: ViT-L/14
	- Training Data: LAION-2B subset


	### BibTeX entry and citation info

	```bibtex
	@misc{jiang2025visiontransformersdontneed,
	title={Vision Transformers Don't Need Trained Registers},
	author={Nick Jiang and Amil Dravid and Alexei Efros and Yossi Gandelsman},
	year={2025},
	eprint={2506.08010},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2506.08010},
	}
	```