Model Card for Model ID
Model Details
Model Description
V-JEPA 2 is a self-supervised video backbone trained on >1 M h of internet video; Meta released checkpoints with a Something-Something v2 action head. I freeze that backbone and fine-tune only the classifier head on the HMDB-51 benchmark (6 766 clips, 51 classes) for 5 epochs. The resulting model reaches competitive top-1 accuracy (see Evaluation)
- Developed by: Sujit Shelar
- Funded by : self-funded (personal compute credits)
- Shared by : V-JEPA 2 ViT-Large (16 frame, 256² patch) video-encoder with a 51-way classification head
- Model type: Vision (video); no text inputs
- Language(s) (NLP) : [More Information Needed]
- License : MIT – identical to the upstream V-JEPA 2 weights
- Finetuned from model : facebook/vjepa2-vitl-fpc16-256-ssv2
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
Rapid benchmarking or research on human-action recognition in academic settings.
Feature extractor for video retrieval or robotics perception pipelines.
Downstream Use [optional]
Starting point for further fine-tuning on custom action datasets (e.g. UCF-101).
Out-of-Scope Use
Any safety-critical decision-making (medical, legal, real-time surveillance).
Generation or captioning tasks – the model outputs only class logits.
Bias, Risks, and Limitations
HMDB-51 clips come largely from Hollywood movies and internet videos, so actions, environments and demographics are skewed towards Western-centric visual culture. The small dataset size (≈6 k clips) may lead to over-fitting and poor generalisation to unseen domains. Users should not rely on predictions for sensitive applications without additional validation.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoVideoProcessor, AutoModelForVideoClassification
import torch, torchvision
model_id = "SujitShelar/vjepa2-vitl-fpc16-256-hmdb51"
processor = AutoVideoProcessor.from_pretrained(model_id)
model = AutoModelForVideoClassification.from_pretrained(model_id)
# sample one 5-sec clip with torchvision.io or torchcodec, shape (T,C,H,W)
video = torch.randn(16, 3, 256, 256) # dummy tensor
inputs = processor(video.unsqueeze(0), return_tensors="pt")
logits = model(**inputs).logits
print(model.config.id2label[logits.argmax(-1).item()])
Training Details
Training Data
HMDB-51 (CC BY-4.0, 6 766 clips across 51 classes). I stratify 70 / 15 / 15 % into train/val/test (4 736 / 1 015 / 1 015 clips).
Training Procedure
value | |
---|---|
Frozen layers | all V-JEPA 2 backbone blocks |
Trainable params | 1.2 M (classification head) |
Epochs | 5 |
Effective batch | 16 (physical 4 × grad-accum 4) |
Optimiser | Adam (lr 1 e-5) |
Augmentations | RandomResizedCrop 256², RandomHorizontalFlip |
Hardware | 1× nvidia-a100-80gb |
Preprocessing
Clips are sampled at 16 frames per video (torchcodec.clips_at_random_indices), resized/cropped to 256², then normalised by the processor.
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
Metric | Definition | Why we use it |
---|---|---|
Top-1 accuracy | Percentage of videos for which the predicted class label exactly matches the single ground-truth action. | HMDB-51 is a 51-way closed-set task; the community almost exclusively quotes Top-1, making our scores directly comparable to prior work. |
(optional) Top-5 accuracy | Video is considered correct if the ground-truth label appears in the five highest-probability classes. | Helpful when the correct class is semantically close to others (e.g. run vs walk), but not reported here to keep the head-only baseline in line with earlier papers. |
Evaluation protocol
Split-1 of HMDB-51 (the canonical 70 / 15 / 15 % stratified split) is used for validation during training and for final test reporting.
We sample one 16-frame clip per video at 256 × 256 resolution and apply a single-crop evaluation, following the JEPA model card. This produces a 5-D tensor (B, T, C, H, W) that the VJEPA2VideoProcessor converts to model inputs.
Accuracy is averaged over the full validation or test set; no class weighting is applied.
Results
Split | Epochs | Top-1 accuracy |
---|---|---|
Validation | 1 → 5 | 14.2 % → 41.9 % |
Test (single-crop, single-clip) | — | 42.9 % |
Numbers come from the run shown in the training logs (runs/vjepa2_hmdb51).
How it compares
Method (ViT-L backbone unless noted) | Trainable params | Clips / crops at test | HMDB-51 Top-1 |
---|---|---|---|
This work – head-only JEPA-L | 1 M (0.3 %) | 1 ✕ 1 | 42.9 % |
Linear probe VideoMAE-B | 0.1 % | 1 ✕ 1 | 38.9 % (arxiv.org) |
Linear probe TimeSformer-B-IN pt | full-frozen | 3 ✕ 10 | 42.9 % (val) (github.com) |
AdaptFormer (last-block adapters) | 1 % | 1 ✕ 1 | 46.1 % (proceedings.neurips.cc) |
CVPT visual-prompt tuning | <1 % | 3 ✕ 10 | 57 % |
Full fine-tune TimeSformer-B | 100 % | 3 ✕ 10 | 64 % (proceedings.neurips.cc) |
Full fine-tune VideoMAE-B | 100 % | 3 ✕ 10 | 73 % (arxiv.org) |
VideoMAE V2-G (giant) | 100 % | 3 ✕ 10 | 86 % (arxiv.org) |
InBrwSANet (CNN + SA) | 100 % | 3 ✕ 10 | 77 % (researchgate.net) |
Take-away
42 – 43 % is in the upper range of published “backbone-frozen” baselines; unlocking a few transformer blocks, adding LoRA / prompt adapters, or running a full fine-tune typically raises HMDB-51 accuracy into the 55 – 70 % bracket. See the Bias, Risks & Limitations and Recommendations sections for caveats and upgrade suggestions.
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications
Model Architecture and Objective
ViT-Large (307 M params backbone) within the V-JEPA 2 framework.
16 × 16 image patches over 256² input; 16-frame temporal tube.
Classification head: two MLP layers (hidden 4 096 → 51 classes).
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation
BibTeX:
@misc{shelar2025vjepa2hmdb51, title = {V-JEPA2 ViT-L fine-tuned on HMDB-51}, author = {Sujit Shelar}, year = {2025}, howpublished = {\url{https://huggingface.co/SujitShelar/vjepa2-vitl-fpc16-256-hmdb51}}, note = {Fine-tuned from Assran et al. (2025) V-JEPA 2.} }
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 18