Model Card for VJEPA2 Fine-tuned Video Classification Model

This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.

Model Details

Model Description

This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.

Developed by: Yiqiao Yin
Funded by: Yiqiao Yin
Model type: Video Classification
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: qubvel-hf/vjepa2-vitl-fpc16-256-ssv2

Model Sources

Repository: [More Information Needed]
Paper: V-JEPA: Video Joint Embedding Predictive Architecture
Base Model: qubvel-hf/vjepa2-vitl-fpc16-256-ssv2

Uses

Direct Use

This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.

Downstream Use

The model can be further fine-tuned for specific video understanding tasks such as:

Action recognition
Video content classification
Temporal activity detection
Video scene understanding

Out-of-Scope Use

This model is not intended for:

Real-time video processing applications requiring sub-second inference
High-resolution video analysis beyond the training resolution
Audio-based video classification (visual features only)
Video generation or synthesis tasks

Bias, Risks, and Limitations

The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.

Recommendations

Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.

How to Get Started with the Model

Use the code below to get started with the model:

import torch
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification

# Load the model and processor
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
model = VJEPA2ForVideoClassification.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    label2id=label2id,  # Your label mapping
    id2label=id2label,  # Your ID to label mapping
    ignore_mismatched_sizes=True,
).to("cuda")

# Process video and get predictions
inputs = processor(video_data, return_tensors="pt").to(model.device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.

Training Procedure

Preprocessing

Videos are processed using the VJEPA2VideoProcessor, which handles:

Video frame extraction and normalization
Temporal sampling
Spatial resizing and augmentation
Tensor conversion for model input

Training Hyperparameters

Training regime: FP32 precision
Optimizer: Adam
Learning rate: 1e-5
Epochs: 5
Gradient accumulation steps: 4
Backbone freezing: VJEPA2 backbone parameters frozen, only classification head trained
Batch processing: Gradient accumulation for effective larger batch size

Training Configuration

# Freeze backbone parameters
for param in model.vjepa2.parameters():
    param.requires_grad = False

# Only train classification head
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=1e-5)

Speeds, Sizes, Times

Training time: Depends on dataset size and hardware
GPU memory: Optimized through gradient accumulation
Effective batch size: Original batch size × 4 (due to gradient accumulation)

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.

Factors

Evaluation considers:

Video content diversity
Temporal complexity
Visual quality variations
Classification difficulty across different classes

Metrics

Primary metric: Classification Accuracy
Validation: Per-epoch validation accuracy
Final evaluation: Test set accuracy

Results

The model's performance is monitored through:

Training loss progression with gradient accumulation
Validation accuracy per epoch
Final test accuracy
TensorBoard logging for comprehensive monitoring

Model Examination

The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:

Preserves pre-trained video understanding capabilities
Reduces computational requirements
Prevents overfitting on smaller datasets
Enables efficient domain adaptation

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA GPU (CUDA-enabled)
Hours used: Dependent on dataset size and training configuration
Training efficiency: Optimized through gradient accumulation and backbone freezing
Carbon Emitted: Reduced due to efficient fine-tuning approach

Technical Specifications

Model Architecture and Objective

Base Architecture: VJEPA2 (Video Joint Embedding Predictive Architecture)
Model Size: ViT-Large with 16-frame processing capability
Input Resolution: 256x256 pixels
Temporal Sampling: 16 frames per video
Classification Head: Custom layer adapted to target classes
Objective: Cross-entropy loss for multi-class classification

Compute Infrastructure

Hardware

GPU: NVIDIA CUDA-compatible GPU
Memory: Sufficient VRAM for model and gradient accumulation
Compute Capability: CUDA support required

Software

Framework: PyTorch
Library: Transformers (Hugging Face)
Dependencies:
- torch
- transformers
- VJEPA2VideoProcessor
- VJEPA2ForVideoClassification

Citation

BibTeX:

@article{bardes2024vjepa,
  title={V-JEPA: Video Joint Embedding Predictive Architecture},
  author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
  journal={arXiv preprint arXiv:2301.08243},
  year={2024}
}

APA:

Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.

Glossary

VJEPA2: Video Joint Embedding Predictive Architecture, second version
Gradient Accumulation: Technique to simulate larger batch sizes by accumulating gradients over multiple steps
Backbone Freezing: Training strategy where pre-trained layers are frozen and only task-specific layers are trained
Video Classification: Task of assigning categorical labels to video sequences

More Information

For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.

Model Card Authors

Yiqiao Yin

Model Card Contact

For questions or issues regarding this model, please contact the model author or create an issue in the model repository.