TARA: Text Adapted Retrieval Alignment for Nuanced Video Retrieval

This repository contains inference and evaluation code for the TARA model based on the paper: Adapting MLLMs for Nuanced Video Retrieval

Project Page     GitHub Code     arXiv     Dataset on Hugging Face

TARA architecture

TARA Architecture: We use EOL prompt to embed videos using an MLLM (Tarsier2-7B). We train the LLM weights with contrastive loss on carefully crafted hard-negatives to instill (i) temporal, (ii) negation and (iii) multimodal nuances in the embedding space.

Table of Contents

Installation & Setup

First, clone the repository:

git clone https://github.com/bpiyush/tara.git
cd tara

1. Install Git LFS (if not already installed)

Git LFS is required to download the model weights.

Please install Git LFS from https://git-lfs.github.com/. You can refer to this guide for non-sudo installation. I have not tested this guide, but it should work.

Check the installation:

git lfs --version
git lfs install

The output should be:

git-lfs/3.4.1 (GitHub; linux amd64; go 1.20.11; git 0898dcbc
Updated Git hooks.
Git LFS initialized.

2. Download the Model Weights

git clone https://huggingface.co/bpiyush/TARA /path/to/download/tara
cd TARA

This will download all model weights (may take a few minutes depending on your connection).

3. Install Dependencies

  • Create/activate the conda env (skip if you already have it):
    conda create -n tara python=3.10 -y
    conda activate tara
    
  • Install CUDA 12.1 PyTorch wheels (adjust the index URL if you need a different CUDA/CPU build):
    pip install --index-url https://download.pytorch.org/whl/cu121 \
      torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121
    
  • Install the remaining model dependencies:
    pip install -r requirements.txt
    
  • (Optional) Verify the install:
    python -c "import torch, transformers; print(torch.cuda.is_available(), transformers.__version__)"
    

Quick Start

TARA is primarily designed to encode videos and texts in a joint embedding space under an MLLM.

import torch
from modeling_tara import TARA

model = TARA.from_pretrained(
    "/path/to/download/tara",  # Load from current directory
    device_map='auto',
    torch_dtype=torch.bfloat16,
)
n_params = sum(p.numel() for p in model.model.parameters())
print(f"Number of parameters: {round(n_params/1e9, 3)}B")

# Embed a video
video_path = "./assets/folding_paper.mp4"
with torch.no_grad():
    video_emb = model.encode_vision(video_path).cpu().squeeze(0).float()
print(f"Video embedding shape: {video_emb.shape}")  # torch.Size([3584])

# Embed a text
text = ['someone is folding a paper', 'cutting a paper', 'someone is folding a paper']
with torch.no_grad():
    text_emb = model.encode_text(text).cpu().float()
print(f"Text embedding shape: {text_emb.shape}")  # torch.Size([3, 3584])

For a more detailed demo, see the script at demo_usage.py. You can run it:

python demo_usage.py --model_path /path/to/download/tara

The output should look something like this:

============================================================
TARA Model Demo
============================================================

[1/5] Loading model...
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.07it/s]
✓ Model loaded successfully!
Number of parameters: 8.291B
----------------------------------------------------------------------------------------------------

[2/5] Testing video encoding ...
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
✓ Video encoded successfully!
Video embedding shape: torch.Size([3584])
----------------------------------------------------------------------------------------------------

[3/5] Testing text encoding...
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
✓ Text encoded successfully!
Text: ['someone is folding a paper', 'cutting a paper', 'someone is unfolding a paper']
Text embedding shape: torch.Size([3, 3584])

[4/5] Computing video-text similarities...
✓ Similarities computed!
  'someone is folding a paper': 0.6488
  'cutting a paper': 0.3952
  'someone is unfolding a paper': 0.3009
----------------------------------------------------------------------------------------------------

[5/5] Testing negation example...
Image embedding shape: torch.Size([2, 3584])
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Text query:  ['an image of a cat but there is no dog in it']
Text-Image similarity: tensor([[0.5169, 0.3659]])
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Text query:  ['an image of a cat and a dog together']
Text-Image similarity: tensor([[0.4364, 0.6004]])
----------------------------------------------------------------------------------------------------

[Bonus] Testing composed video retrieval...
Source-Target similarity with edit: 0.757888674736023

============================================================
Demo completed successfully! 🎉
============================================================

Evaluation

Data Preparation

We release the nuanced video retrieval splits used in the dataset in data/ folder. For ease of use, we have combined all the data for (i) temporal, (ii) negation and (iii) multimodal nuance into a single file where each entry is a video/text/video-text/image, etc.

data
├── nuanced_retrieval_inputs-test.csv # List of examples to embed (video, text, composed video-text, etc.) for test set
├── nuanced_retrieval_inputs-val.csv # List of examples to embed (video, text, composed video-text, etc.) for validation set
├── nuanced_retrieval_labels-test.json # Labels for test set
└── nuanced_retrieval_labels-val.json # Labels for validation set

An example input row looks like this:

{
  'id': '138629', 
  'value': '138629',
  'nuance': 'time',
  'source': 'cia-ssv2',
  'modality': 'video',
}

where idis the unique identified, value is actual value (e.g., for a text caption, the ID can be different and value stores the actual caption), nuance is the type of nuance, source is the source of the example (e.g., cia-ssv2 for SSv2), and modality is the modality of the example (e.g., video or text).

The coresponding label looks like this:

['12055391_1.0']

which denotes the id of the text associated with the video.

Finally, set the right paths to the data directories in evals/compute_embeddings.py based on your local setup.

Embedding Computation

First, you need to compute the embeddings for the entire dataset. You can do this by running the following script:

python evals/compute_embeddings.py \
--model_path /path/to/download/tara \
--csv_path ./data/nuanced_retrieval_inputs-val.csv \
--model_name tara_7b

Then, run the script to compute retrieval metrics.

python evals/compute_metrics.py \
--model_path /path/to/download/tara \
--csv_path ./data/nuanced_retrieval_inputs-val.csv \
--lab_path ./data/nuanced_retrieval_labels-val.json \
--model_name tara_7b

General evaluation: MMEB-V2 (Meng et al.)

We evaluate on the video classification and video retrieval tasks in MMEB-V2 to demonstrate the generalizability of TARA.

TODO

Citation

If you use this model, please cite:

@article{tara2025,
  title={Adapting MLLMs for Nuanced Video Retrieval},
  author={Piyush Bagad and Andrew Zisserman},
  year={2025}
  journal={arXiv preprint arXiv:2512.13511}
}
@article{bagad2025chirality,
  title={Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening},
  author={Bagad, Piyush and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2509.08502},
  year={2025}
}

License

Apache 2.0

Downloads last month
28
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bpiyush/TARA

Finetuned
(1)
this model

Datasets used to train bpiyush/TARA

Papers for bpiyush/TARA