Model Card for Model ID

Model Details

Model Description

V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)

Developed by: Sujit Shelar
Funded by : self-funded (personal compute credits)
Shared by : V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
Model type: Vision (video); no text inputs
Language(s) (NLP) : [More Information Needed]
License : MIT – identical to the upstream V-JEPA 2 weights
Finetuned from model : facebook/vjepa2-vitl-fpc16-256-ssv2

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

Rapid benchmarking or research on human-action recognition in academic settings.

Feature extractor for video retrieval or robotics perception pipelines.

Downstream Use [optional]

Starting point for further fine-tuning on custom action datasets (e.g. UCF-101).

Out-of-Scope Use

Any safety-critical decision-making (medical, legal, real-time surveillance).

Generation or captioning tasks – the model outputs only class logits.

Bias, Risks, and Limitations

HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoVideoProcessor, AutoModelForVideoClassification
import torch, torchvision

model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51"
processor = AutoVideoProcessor.from_pretrained(model_id)
model     = AutoModelForVideoClassification.from_pretrained(model_id)

# sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W)
video = torch.randn(16, 3, 256, 256)        # dummy tensor

inputs = processor(video.unsqueeze(0), return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax(-1).item()])

Training Details

Training Data

HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips).

Training Procedure

	value
Frozen layers	all V-JEPA 2 backbone blocks
Trainable params	1.2 M (classification head)
Epochs	5
Effective batch	16 (physical 4 × grad-accum 4)
Optimiser	Adam (lr 1 e-5)
Augmentations	RandomResizedCrop 256², RandomHorizontalFlip
Hardware	1× nvidia-a100-80gb

Preprocessing

Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

Metric	Definition	Why we use it
Top-1 accuracy	Percentage of videos for which the predicted class label exactly matches the single ground-truth action.	HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work.
(optional) Top-5 accuracy	Video is considered correct if the ground-truth label appears in the five highest-probability classes.	Helpful when the correct class is semantically close to others (e.g. run vs walk), but not reported here to keep the head-only baseline in line with earlier papers.

Evaluation protocol

Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting.

We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs.

Accuracy is averaged over the full validation or test set; no class weighting is applied.

Results

Split	Epochs	Top-1 accuracy
Validation	1 → 5	14.2 % → 41.9 %
Test (single-crop, single-clip)	—	42.9 %

_{Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51).}

How it compares

Method (ViT-L backbone unless noted)	Trainable params	Clips / crops at test	HMDB-51 Top-1
This work – head-only JEPA-L	1 M (0.3 %)	1 ✕ 1	42.9 %
Linear probe VideoMAE-B	0.1 %	1 ✕ 1	38.9 % (arxiv.org)
Linear probe TimeSformer-B-IN pt	full-frozen	3 ✕ 10	42.9 % (val) (github.com)
AdaptFormer (last-block adapters)	1 %	1 ✕ 1	46.1 % (proceedings.neurips.cc)
CVPT visual-prompt tuning	<1 %	3 ✕ 10	57 %
Full fine-tune TimeSformer-B	100 %	3 ✕ 10	64 % (proceedings.neurips.cc)
Full fine-tune VideoMAE-B	100 %	3 ✕ 10	73 % (arxiv.org)
VideoMAE V2-G (giant)	100 %	3 ✕ 10	86 % (arxiv.org)
InBrwSANet (CNN + SA)	100 %	3 ✕ 10	77 % (researchgate.net)

Take-away

42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions.

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

ViT-Large (307 M params backbone) within the V-JEPA 2 framework.

16 × 16 image patches over 256² input; 16-frame temporal tube.

Classification head: two MLP layers (hidden 4 096 → 51 classes).

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation

BibTeX:

@misc{shelar2025vjepa2hmdb51, title = {V-JEPA2 ViT-L fine-tuned on HMDB-51}, author = {Sujit Shelar}, year = {2025}, howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}}, note = {Fine-tuned from Assran et al. (2025) V-JEPA 2.} }

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]