# Use VJEPA 2

V-JEPA 2 is a new open 1.2B video embedding model by Meta, which attempts to capture the physical world modelling through video ⏯️

The model can be used for various tasks for video: fine-tuning for downstream tasks like video classification, or any task involving embeddings (similarity, retrieval and more!).

You can check all V-JEPA 2 checkpoints and the datasets that come with this release [in this collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6).

We need to install transformers' release specific branch.

In [None]:
!pip install -q git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview

In [None]:
from huggingface_hub import login # to later push the model

login()

As of now, Colab supports torchcodec==0.2.1 which supports torch==2.6.0.

In [None]:
!pip install -q torch==2.6.0 torchvision==0.21.0
!pip install -q torchcodec==0.2.1

import torch
print("Torch:", torch.__version__)
from torchcodec.decoders import VideoDecoder # verify

## Initialize the model and the processor

In [None]:
from transformers import AutoVideoProcessor, AutoModel

hf_repo = "facebook/vjepa2-vitg-fpc64-384"

model = AutoModel.from_pretrained(hf_repo).to("cuda")
processor = AutoVideoProcessor.from_pretrained(hf_repo)

## Extract video embeddings from the model

In [None]:
import torch
from torchcodec.decoders import VideoDecoder
import numpy as np

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
with torch.no_grad():
    video_embeddings = model.get_vision_features(**video)

print(video_embeddings.shape)