Model Card for VJEPA2 Fine-tuned Video Classification Model

This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.

Model Details

Model Description

This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.

  • Developed by: Yiqiao Yin
  • Funded by: Yiqiao Yin
  • Model type: Video Classification
  • Language(s) (NLP): English
  • License: Apache 2.0
  • Finetuned from model: qubvel-hf/vjepa2-vitl-fpc16-256-ssv2

Model Sources

Uses

Direct Use

This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.

Downstream Use

The model can be further fine-tuned for specific video understanding tasks such as:

  • Action recognition
  • Video content classification
  • Temporal activity detection
  • Video scene understanding

Out-of-Scope Use

This model is not intended for:

  • Real-time video processing applications requiring sub-second inference
  • High-resolution video analysis beyond the training resolution
  • Audio-based video classification (visual features only)
  • Video generation or synthesis tasks

Bias, Risks, and Limitations

The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.

Recommendations

Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.

How to Get Started with the Model

Use the code below to get started with the model:

import torch
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification

# Load the model and processor
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
model = VJEPA2ForVideoClassification.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    label2id=label2id,  # Your label mapping
    id2label=id2label,  # Your ID to label mapping
    ignore_mismatched_sizes=True,
).to("cuda")

# Process video and get predictions
inputs = processor(video_data, return_tensors="pt").to(model.device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

Training Details

Training Data

The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.

Training Procedure

Preprocessing

Videos are processed using the VJEPA2VideoProcessor, which handles:

  • Video frame extraction and normalization
  • Temporal sampling
  • Spatial resizing and augmentation
  • Tensor conversion for model input

Training Hyperparameters

  • Training regime: FP32 precision
  • Optimizer: Adam
  • Learning rate: 1e-5
  • Epochs: 5
  • Gradient accumulation steps: 4
  • Backbone freezing: VJEPA2 backbone parameters frozen, only classification head trained
  • Batch processing: Gradient accumulation for effective larger batch size

Training Configuration

# Freeze backbone parameters
for param in model.vjepa2.parameters():
    param.requires_grad = False

# Only train classification head
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=1e-5)

Speeds, Sizes, Times

  • Training time: Depends on dataset size and hardware
  • GPU memory: Optimized through gradient accumulation
  • Effective batch size: Original batch size ร— 4 (due to gradient accumulation)

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.

Factors

Evaluation considers:

  • Video content diversity
  • Temporal complexity
  • Visual quality variations
  • Classification difficulty across different classes

Metrics

  • Primary metric: Classification Accuracy
  • Validation: Per-epoch validation accuracy
  • Final evaluation: Test set accuracy

Results

The model's performance is monitored through:

  • Training loss progression with gradient accumulation
  • Validation accuracy per epoch
  • Final test accuracy
  • TensorBoard logging for comprehensive monitoring

Model Examination

The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:

  • Preserves pre-trained video understanding capabilities
  • Reduces computational requirements
  • Prevents overfitting on smaller datasets
  • Enables efficient domain adaptation

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA GPU (CUDA-enabled)
  • Hours used: Dependent on dataset size and training configuration
  • Training efficiency: Optimized through gradient accumulation and backbone freezing
  • Carbon Emitted: Reduced due to efficient fine-tuning approach

Technical Specifications

Model Architecture and Objective

  • Base Architecture: VJEPA2 (Video Joint Embedding Predictive Architecture)
  • Model Size: ViT-Large with 16-frame processing capability
  • Input Resolution: 256x256 pixels
  • Temporal Sampling: 16 frames per video
  • Classification Head: Custom layer adapted to target classes
  • Objective: Cross-entropy loss for multi-class classification

Compute Infrastructure

Hardware

  • GPU: NVIDIA CUDA-compatible GPU
  • Memory: Sufficient VRAM for model and gradient accumulation
  • Compute Capability: CUDA support required

Software

  • Framework: PyTorch
  • Library: Transformers (Hugging Face)
  • Dependencies:
    • torch
    • transformers
    • VJEPA2VideoProcessor
    • VJEPA2ForVideoClassification

Citation

BibTeX:

@article{bardes2024vjepa,
  title={V-JEPA: Video Joint Embedding Predictive Architecture},
  author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
  journal={arXiv preprint arXiv:2301.08243},
  year={2024}
}

APA:

Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.

Glossary

  • VJEPA2: Video Joint Embedding Predictive Architecture, second version
  • Gradient Accumulation: Technique to simulate larger batch sizes by accumulating gradients over multiple steps
  • Backbone Freezing: Training strategy where pre-trained layers are frozen and only task-specific layers are trained
  • Video Classification: Task of assigning categorical labels to video sequences

More Information

For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.

Model Card Authors

Yiqiao Yin

Model Card Contact

For questions or issues regarding this model, please contact the model author or create an issue in the model repository.

Downloads last month
10
Safetensors
Model size
375M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support