Model Card for VJEPA2 Fine-tuned Video Classification Model
This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.
Model Details
Model Description
This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.
- Developed by: Yiqiao Yin
- Funded by: Yiqiao Yin
- Model type: Video Classification
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: qubvel-hf/vjepa2-vitl-fpc16-256-ssv2
Model Sources
- Repository: [More Information Needed]
- Paper: V-JEPA: Video Joint Embedding Predictive Architecture
- Base Model: qubvel-hf/vjepa2-vitl-fpc16-256-ssv2
Uses
Direct Use
This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.
Downstream Use
The model can be further fine-tuned for specific video understanding tasks such as:
- Action recognition
- Video content classification
- Temporal activity detection
- Video scene understanding
Out-of-Scope Use
This model is not intended for:
- Real-time video processing applications requiring sub-second inference
- High-resolution video analysis beyond the training resolution
- Audio-based video classification (visual features only)
- Video generation or synthesis tasks
Bias, Risks, and Limitations
The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.
Recommendations
Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.
How to Get Started with the Model
Use the code below to get started with the model:
import torch
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification
# Load the model and processor
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
model = VJEPA2ForVideoClassification.from_pretrained(
model_name,
torch_dtype=torch.float32,
label2id=label2id, # Your label mapping
id2label=id2label, # Your ID to label mapping
ignore_mismatched_sizes=True,
).to("cuda")
# Process video and get predictions
inputs = processor(video_data, return_tensors="pt").to(model.device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
Training Details
Training Data
The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.
Training Procedure
Preprocessing
Videos are processed using the VJEPA2VideoProcessor, which handles:
- Video frame extraction and normalization
- Temporal sampling
- Spatial resizing and augmentation
- Tensor conversion for model input
Training Hyperparameters
- Training regime: FP32 precision
- Optimizer: Adam
- Learning rate: 1e-5
- Epochs: 5
- Gradient accumulation steps: 4
- Backbone freezing: VJEPA2 backbone parameters frozen, only classification head trained
- Batch processing: Gradient accumulation for effective larger batch size
Training Configuration
# Freeze backbone parameters
for param in model.vjepa2.parameters():
param.requires_grad = False
# Only train classification head
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=1e-5)
Speeds, Sizes, Times
- Training time: Depends on dataset size and hardware
- GPU memory: Optimized through gradient accumulation
- Effective batch size: Original batch size ร 4 (due to gradient accumulation)
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.
Factors
Evaluation considers:
- Video content diversity
- Temporal complexity
- Visual quality variations
- Classification difficulty across different classes
Metrics
- Primary metric: Classification Accuracy
- Validation: Per-epoch validation accuracy
- Final evaluation: Test set accuracy
Results
The model's performance is monitored through:
- Training loss progression with gradient accumulation
- Validation accuracy per epoch
- Final test accuracy
- TensorBoard logging for comprehensive monitoring
Model Examination
The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:
- Preserves pre-trained video understanding capabilities
- Reduces computational requirements
- Prevents overfitting on smaller datasets
- Enables efficient domain adaptation
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA GPU (CUDA-enabled)
- Hours used: Dependent on dataset size and training configuration
- Training efficiency: Optimized through gradient accumulation and backbone freezing
- Carbon Emitted: Reduced due to efficient fine-tuning approach
Technical Specifications
Model Architecture and Objective
- Base Architecture: VJEPA2 (Video Joint Embedding Predictive Architecture)
- Model Size: ViT-Large with 16-frame processing capability
- Input Resolution: 256x256 pixels
- Temporal Sampling: 16 frames per video
- Classification Head: Custom layer adapted to target classes
- Objective: Cross-entropy loss for multi-class classification
Compute Infrastructure
Hardware
- GPU: NVIDIA CUDA-compatible GPU
- Memory: Sufficient VRAM for model and gradient accumulation
- Compute Capability: CUDA support required
Software
- Framework: PyTorch
- Library: Transformers (Hugging Face)
- Dependencies:
- torch
- transformers
- VJEPA2VideoProcessor
- VJEPA2ForVideoClassification
Citation
BibTeX:
@article{bardes2024vjepa,
title={V-JEPA: Video Joint Embedding Predictive Architecture},
author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
journal={arXiv preprint arXiv:2301.08243},
year={2024}
}
APA:
Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.
Glossary
- VJEPA2: Video Joint Embedding Predictive Architecture, second version
- Gradient Accumulation: Technique to simulate larger batch sizes by accumulating gradients over multiple steps
- Backbone Freezing: Training strategy where pre-trained layers are frozen and only task-specific layers are trained
- Video Classification: Task of assigning categorical labels to video sequences
More Information
For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.
Model Card Authors
Yiqiao Yin
Model Card Contact
For questions or issues regarding this model, please contact the model author or create an issue in the model repository.
- Downloads last month
- 10