{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "02ruu54h4yLc" }, "source": [ "# V-JEPA 2" ] }, { "cell_type": "markdown", "metadata": { "id": "ol0IGYCd4hg4" }, "source": [ "V-JEPA 2 is a new open 1.2B video embedding model by Meta, which attempts to capture the physical world modelling through video ⏯️\n", "\n", "The model can be used for various tasks for video: fine-tuning for downstream tasks like video classification, or any task involving embeddings (similarity, retrieval and more!).\n", "\n", "You can check all V-JEPA 2 checkpoints and the datasets that come with this release [in this collection](collection_link). You can also read about the release [here](blog_link).\n", "\n", "In this notebook we will go through:\n", "1. Using V-JEPA 2 as feature extractor,\n", "2. Using V-JEPA 2 for video classification\n", "3. fine-tuning V-JEPA 2, on [UCF-101 action recognition dataset](https://huggingface.co/datasets/sayakpaul/ucf101-subset) using transformers.\n", "\n", "Let's go!" ] }, { "cell_type": "markdown", "metadata": { "id": "kIIBxYOA41Ga" }, "source": [ "We need to install accelerate, datasets and transformers' main branch." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jDnyNeongiTP" }, "outputs": [], "source": [ "!pip install -U https://github.com/huggingface/transformers" ] }, { "cell_type": "markdown", "metadata": { "id": "cDzUdYIHXLbP" }, "source": [ "torchcodec 0.2.1 supports Colab, which is supported by different torch and torchvision versions, so let's update them too." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qqoeMBmxrY-X", "outputId": "f9162900-d083-43f0-8c60-1db57a508f17" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m363.4/363.4 MB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.8/13.8 MB\u001b[0m \u001b[31m124.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.6/24.6 MB\u001b[0m \u001b[31m100.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m883.7/883.7 kB\u001b[0m \u001b[31m63.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m664.8/664.8 MB\u001b[0m \u001b[31m2.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.5/211.5 MB\u001b[0m \u001b[31m11.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 MB\u001b[0m \u001b[31m43.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.9/127.9 MB\u001b[0m \u001b[31m8.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.5/207.5 MB\u001b[0m \u001b[31m12.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.1/21.1 MB\u001b[0m \u001b[31m115.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m766.4/766.4 kB\u001b[0m \u001b[31m21.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hTorch: 2.6.0+cu124\n" ] } ], "source": [ "!pip install -q torch==2.6.0 torchvision==0.21.0\n", "!pip install -q torchcodec==0.2.1\n", "\n", "import torch\n", "print(\"Torch:\", torch.__version__)\n", "from torchcodec.decoders import VideoDecoder # validate it works\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lE7_UmAfXr_X" }, "source": [ "## Simple Inference\n", "\n", "**Inference with Embeddings**\n", "\n", "You can initialize the V-JEPA 2 with ViT Giant checkpoint as follows. Feel free to replace the ID with the one you want to use. Here's [the model collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RHx7cQQvYMVW", "outputId": "e1e48780-38dc-4c49-d530-6adb42f0342b" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n", "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", "You will be able to reuse this secret in all of your notebooks.\n", "Please note that authentication is recommended but still optional to access public models or datasets.\n", " warnings.warn(\n" ] } ], "source": [ "from transformers import AutoVideoProcessor, AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"facebook/vjepa2-vitg-fpc64-384\").to(\"cuda\")\n", "processor = AutoVideoProcessor.from_pretrained(\"facebook/vjepa2-vitg-fpc64-384\")" ] }, { "cell_type": "markdown", "metadata": { "id": "QS0TNSY3YWrv" }, "source": [ "Simply infer to get the embeddings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xPXRC9nWXrRd", "outputId": "374f9ee2-6497-41cf-ce32-49955e726ec5" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "torch.Size([1, 9216, 1408])\n" ] } ], "source": [ "import torch\n", "from torchcodec.decoders import VideoDecoder\n", "import numpy as np\n", "\n", "video_url = \"https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/holding_phone.mp4\"\n", "vr = VideoDecoder(video_url)\n", "frame_idx = np.arange(0, 32) # choosing some frames. here, you can define more complex sampling strategy\n", "video_frames = vr.get_frames_at(indices=frame_idx).data # T x C x H x W\n", "video = processor(video_frames, return_tensors=\"pt\").to(model.device)\n", "with torch.no_grad():\n", " video_embeddings = model.get_vision_features(**video)\n", "\n", "print(video_embeddings.shape)\n", "del model\n" ] }, { "cell_type": "markdown", "metadata": { "id": "cwnHTJoVXwHs" }, "source": [ "**Inference for Video Classification**\n", "\n", "Meta also provides a model trained on SomethingSomething-v2, a dataset of human-object interactions with 174 classes." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ul6MlUpJXy2k" }, "outputs": [], "source": [ "from transformers import VJEPA2ForVideoClassification, VJEPA2VideoProcessor\n", "\n", "model = VJEPA2ForVideoClassification.from_pretrained(\"facebook/vjepa2-vitl-fpc16-256-ssv2\").to(\"cuda\")\n", "processor = VJEPA2VideoProcessor.from_pretrained(\"facebook/vjepa2-vitl-fpc16-256-ssv2\")" ] }, { "cell_type": "markdown", "metadata": { "id": "guYY0V0Ob7li" }, "source": [ "We can pass the same frames to the new processor." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CsKds6fmYmrv", "outputId": "d0214cb6-bed5-415e-f657-dea5e6db779d" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Touching (without moving) [part] of [something]\n" ] } ], "source": [ "inputs = processor(video_frames, return_tensors=\"pt\").to(model.device)\n", "\n", "with torch.no_grad():\n", " outputs = model(**inputs)\n", " logits = outputs.logits\n", "\n", "predicted_label = logits.argmax(-1).item()\n", "print(model.config.id2label[predicted_label])" ] }, { "cell_type": "markdown", "metadata": { "id": "sJNQs3VeckNG" }, "source": [ "## Data Preprocessing for Fine-tuning" ] }, { "cell_type": "markdown", "metadata": { "id": "_GF3Q4-66Csr" }, "source": [ "Let's load the dataset first. UCF-101 consists of 101 different actions covering from blowing candles to playing violin. We will use [a smaller subset of UCF-101](https://huggingface.co/datasets/sayakpaul/ucf101-subset)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fuDD5FLs7pAw" }, "outputs": [], "source": [ "from huggingface_hub import hf_hub_download\n", "import tarfile\n", "import pathlib\n", "\n", "fpath = hf_hub_download(repo_id=\"sayakpaul/ucf101-subset\", filename=\"UCF101_subset.tar.gz\", repo_type=\"dataset\")\n", "\n", "with tarfile.open(fpath) as t:\n", " t.extractall(\".\")\n", "\n", "dataset_root_path = pathlib.Path(\"UCF101_subset\")\n", "all_video_file_paths = list(dataset_root_path.glob(\"**/*.avi\"))" ] }, { "cell_type": "markdown", "metadata": { "id": "Hiyyi4ai89D7" }, "source": [ "We gather different splits as lists to later create the `DataLoader` for training." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6kkNNuR88qjM", "outputId": "e1e48214-b445-409e-ea2e-9e42f8810593" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Total videos: 405\n" ] } ], "source": [ "train_video_file_paths = []\n", "val_video_file_paths = []\n", "test_video_file_paths = []\n", "\n", "for video_file_path in all_video_file_paths:\n", " video_parts = video_file_path.parts\n", " if \"train\" in video_parts:\n", " train_video_file_paths.append(video_file_path)\n", " elif \"val\" in video_parts:\n", " val_video_file_paths.append(video_file_path)\n", " elif \"test\" in video_parts:\n", " test_video_file_paths.append(video_file_path)\n", " else:\n", " raise ValueError(f\"Unknown video part: {video_parts}\")\n", "\n", "video_count_train = len(train_video_file_paths)\n", "video_count_val = len(val_video_file_paths)\n", "video_count_test = len(test_video_file_paths)\n", "\n", "video_total = video_count_train + video_count_val + video_count_test\n", "print(f\"Total videos: {video_total}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "PAz1842v9ib9" }, "source": [ "We need to keep a class label to human-readable label mapping and number of classes to later initialize our model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CRkkxqMO9dHG" }, "outputs": [], "source": [ "class_labels = {path.parts[2] for path in all_video_file_paths}\n", "label2id = {label: i for i, label in enumerate(class_labels)}\n", "id2label = {i: label for label, i in label2id.items()}" ] }, { "cell_type": "markdown", "metadata": { "id": "N9kLrfnC9pWb" }, "source": [ "We will create a `CustomVideoDataset` class and initialize our train/test/validation sets for DataLoader." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wszx85M98vwl" }, "outputs": [], "source": [ "from torch.utils.data import Dataset, DataLoader\n", "from torchcodec.decoders import VideoDecoder\n", "\n", "class CustomVideoDataset(Dataset):\n", " def __init__(self, video_file_paths, label2id):\n", " self.video_file_paths = video_file_paths\n", " self.label2id = label2id\n", "\n", " def __len__(self):\n", " return len(self.video_file_paths)\n", "\n", " def __getitem__(self, idx):\n", " video_path = self.video_file_paths[idx]\n", " label = video_path.parts[2]\n", " decoder = VideoDecoder(video_path)\n", " return decoder, self.label2id[label]\n", "\n", "train_ds = CustomVideoDataset(train_video_file_paths, label2id)\n", "val_ds = CustomVideoDataset(val_video_file_paths, label2id)\n", "test_ds = CustomVideoDataset(test_video_file_paths, label2id)" ] }, { "cell_type": "markdown", "metadata": { "id": "zrsTuN7TYxwW" }, "source": [ "V-JEPA 2 is an embedding model. To fine-tune it, we need to load the weights with a randomly initialized task-specific head put on top of them. For this, we can use `VJEPA2ForVideoClassification` class. During the initialization, we should pass in the mapping between the class labels and human readable labels, so the classification head has the same number of classes, and directly outputs human-readable labels with the confidence scores.\n", "\n", "On a separate note, if you want to only use embeddings, you can use `AutoModel` to do so. This can be used for e.g. video-to-video retrieval or calculating similarity between videos.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "7tVsJ3WWBONu" }, "source": [ "We can now define augmentations and create the data collator. This notebook is made for tutorial purposes, so we keep the augmentations minimal. We can finally initialize the DataLoader afterwards." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I9biPx184MxY" }, "outputs": [], "source": [ "from torchcodec.samplers import clips_at_random_indices\n", "from torchvision.transforms import v2\n", "\n", "\n", "def collate_fn(\n", " samples, frames_per_clip, transforms\n", "):\n", " \"\"\"Sample clips and apply transforms to a batch.\"\"\"\n", " clips, labels = [], []\n", " for decoder, lbl in samples:\n", " clip = clips_at_random_indices(\n", " decoder,\n", " num_clips=1,\n", " num_frames_per_clip=frames_per_clip,\n", " num_indices_between_frames=3,\n", " ).data\n", " clips.append(clip)\n", " labels.append(lbl)\n", "\n", " videos = torch.cat(clips, dim=0)\n", " videos = transforms(videos)\n", " return videos, torch.tensor(labels)\n", "\n", "\n", "\n", "train_transforms = v2.Compose([\n", " v2.RandomResizedCrop((processor.crop_size[\"height\"], processor.crop_size[\"width\"])),\n", " v2.RandomHorizontalFlip(),\n", " ])\n", "eval_transforms = v2.Compose([\n", " v2.CenterCrop((processor.crop_size[\"height\"], processor.crop_size[\"width\"]))\n", " ])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Op3hPQCkW16F" }, "outputs": [], "source": [ "from functools import partial\n", "\n", "batch_size = 1\n", "num_workers = 8\n", "\n", "# DataLoaders\n", "train_loader = DataLoader(\n", " train_ds,\n", " batch_size=batch_size,\n", " shuffle=True,\n", " collate_fn=partial(collate_fn, frames_per_clip=model.config.frames_per_clip, transforms=train_transforms),\n", " num_workers=num_workers,\n", " pin_memory=True,\n", ")\n", "val_loader = DataLoader(\n", " val_ds,\n", " batch_size=batch_size,\n", " shuffle=False,\n", " collate_fn=partial(collate_fn, frames_per_clip=model.config.frames_per_clip, transforms=eval_transforms),\n", " num_workers=num_workers,\n", " pin_memory=True,\n", ")\n", "test_loader = DataLoader(\n", " test_ds,\n", " batch_size=batch_size,\n", " shuffle=False,\n", " collate_fn=partial(collate_fn, frames_per_clip=model.config.frames_per_clip, transforms=eval_transforms),\n", " num_workers=num_workers,\n", " pin_memory=True,\n", ")" ] }, { "cell_type": "markdown", "source": [ "## Model Training" ], "metadata": { "id": "EFmvjlMc28Yl" } }, { "cell_type": "markdown", "metadata": { "id": "zDl_a2VU5Tjc" }, "source": [ "Before training, we will login to HF Hub (to later push the model) and setup tensorboard." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 17, "referenced_widgets": [ "a303072eb4d64cfc9993053056e4ee3a", "324e6eed4d694c7699f0350fc09aa464", "a5450cd991cd4a7483cb3a2ccec107ae", "f6768e8c3b654024a62c9eb14c961962", "a79aaa66ef30468584123226e62c851f", "f41e7dfdb83647e3997f0fbdc5208de2", "61cc34a0c9604e798e196ffe3d44a132", "6d86b57066044c5eae89f773dfbf94d8", "17e05a2f86b24f1a97139cf7827a62cf", "bde755fab399457f92bec0551d717821", "d293ab19173a410789fa08ac175be10d", "faf4387a0f9b4ccf9ad43ca3593b4dd5", "81adb65a7c544469bb4cdd3b2d7eb132", "e1b8bd360bf74c96988d90eb87757017", "cc8c509b8ffe4557aaa2ae4af4c25008", "50da253981f14f42b47b409ddca9caea", "6cbe308d343443328502261eaea02496", "c26df21cb5354d93a008c541fb96ac70", "dd2a4abfd46c4574a668ff4ff5d9de25", "e36ae300929743f997e60c852c41b966" ] }, "id": "8-jbn-7W5fye", "outputId": "128d3a69-1624-471c-82f3-05b0e082fb24" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "VBox(children=(HTML(value='