|
--- |
|
library_name: transformers |
|
tags: |
|
- video-classification |
|
- vjepa2 |
|
- computer-vision |
|
- video-understanding |
|
- fine-tuned |
|
- pytorch |
|
--- |
|
|
|
# Model Card for VJEPA2 Fine-tuned Video Classification Model |
|
|
|
This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation. |
|
|
|
- **Developed by:** Yiqiao Yin |
|
- **Funded by:** Yiqiao Yin |
|
- **Model type:** Video Classification |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache 2.0 |
|
- **Finetuned from model:** qubvel-hf/vjepa2-vitl-fpc16-256-ssv2 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [More Information Needed] |
|
- **Paper:** [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243) |
|
- **Base Model:** [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone. |
|
|
|
### Downstream Use |
|
|
|
The model can be further fine-tuned for specific video understanding tasks such as: |
|
- Action recognition |
|
- Video content classification |
|
- Temporal activity detection |
|
- Video scene understanding |
|
|
|
### Out-of-Scope Use |
|
|
|
This model is not intended for: |
|
- Real-time video processing applications requiring sub-second inference |
|
- High-resolution video analysis beyond the training resolution |
|
- Audio-based video classification (visual features only) |
|
- Video generation or synthesis tasks |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content. |
|
|
|
### Recommendations |
|
|
|
Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model: |
|
|
|
```python |
|
import torch |
|
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification |
|
|
|
# Load the model and processor |
|
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2" |
|
processor = VJEPA2VideoProcessor.from_pretrained(model_name) |
|
model = VJEPA2ForVideoClassification.from_pretrained( |
|
model_name, |
|
torch_dtype=torch.float32, |
|
label2id=label2id, # Your label mapping |
|
id2label=id2label, # Your ID to label mapping |
|
ignore_mismatched_sizes=True, |
|
).to("cuda") |
|
|
|
# Process video and get predictions |
|
inputs = processor(video_data, return_tensors="pt").to(model.device) |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task. |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
|
|
Videos are processed using the VJEPA2VideoProcessor, which handles: |
|
- Video frame extraction and normalization |
|
- Temporal sampling |
|
- Spatial resizing and augmentation |
|
- Tensor conversion for model input |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** FP32 precision |
|
- **Optimizer:** Adam |
|
- **Learning rate:** 1e-5 |
|
- **Epochs:** 5 |
|
- **Gradient accumulation steps:** 4 |
|
- **Backbone freezing:** VJEPA2 backbone parameters frozen, only classification head trained |
|
- **Batch processing:** Gradient accumulation for effective larger batch size |
|
|
|
#### Training Configuration |
|
|
|
```python |
|
# Freeze backbone parameters |
|
for param in model.vjepa2.parameters(): |
|
param.requires_grad = False |
|
|
|
# Only train classification head |
|
trainable = [p for p in model.parameters() if p.requires_grad] |
|
optimizer = torch.optim.Adam(trainable, lr=1e-5) |
|
``` |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
- **Training time:** Depends on dataset size and hardware |
|
- **GPU memory:** Optimized through gradient accumulation |
|
- **Effective batch size:** Original batch size × 4 (due to gradient accumulation) |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch. |
|
|
|
#### Factors |
|
|
|
Evaluation considers: |
|
- Video content diversity |
|
- Temporal complexity |
|
- Visual quality variations |
|
- Classification difficulty across different classes |
|
|
|
#### Metrics |
|
|
|
- **Primary metric:** Classification Accuracy |
|
- **Validation:** Per-epoch validation accuracy |
|
- **Final evaluation:** Test set accuracy |
|
|
|
### Results |
|
|
|
The model's performance is monitored through: |
|
- Training loss progression with gradient accumulation |
|
- Validation accuracy per epoch |
|
- Final test accuracy |
|
- TensorBoard logging for comprehensive monitoring |
|
|
|
## Model Examination |
|
|
|
The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach: |
|
- Preserves pre-trained video understanding capabilities |
|
- Reduces computational requirements |
|
- Prevents overfitting on smaller datasets |
|
- Enables efficient domain adaptation |
|
|
|
## Environmental Impact |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** NVIDIA GPU (CUDA-enabled) |
|
- **Hours used:** Dependent on dataset size and training configuration |
|
- **Training efficiency:** Optimized through gradient accumulation and backbone freezing |
|
- **Carbon Emitted:** Reduced due to efficient fine-tuning approach |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
- **Base Architecture:** VJEPA2 (Video Joint Embedding Predictive Architecture) |
|
- **Model Size:** ViT-Large with 16-frame processing capability |
|
- **Input Resolution:** 256x256 pixels |
|
- **Temporal Sampling:** 16 frames per video |
|
- **Classification Head:** Custom layer adapted to target classes |
|
- **Objective:** Cross-entropy loss for multi-class classification |
|
|
|
### Compute Infrastructure |
|
|
|
#### Hardware |
|
|
|
- **GPU:** NVIDIA CUDA-compatible GPU |
|
- **Memory:** Sufficient VRAM for model and gradient accumulation |
|
- **Compute Capability:** CUDA support required |
|
|
|
#### Software |
|
|
|
- **Framework:** PyTorch |
|
- **Library:** Transformers (Hugging Face) |
|
- **Dependencies:** |
|
- torch |
|
- transformers |
|
- VJEPA2VideoProcessor |
|
- VJEPA2ForVideoClassification |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@article{bardes2024vjepa, |
|
title={V-JEPA: Video Joint Embedding Predictive Architecture}, |
|
author={Bardes, Adrien and Ponce, Jean and LeCun, Yann}, |
|
journal={arXiv preprint arXiv:2301.08243}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
**APA:** |
|
|
|
Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243. |
|
|
|
## Glossary |
|
|
|
- **VJEPA2:** Video Joint Embedding Predictive Architecture, second version |
|
- **Gradient Accumulation:** Technique to simulate larger batch sizes by accumulating gradients over multiple steps |
|
- **Backbone Freezing:** Training strategy where pre-trained layers are frozen and only task-specific layers are trained |
|
- **Video Classification:** Task of assigning categorical labels to video sequences |
|
|
|
## More Information |
|
|
|
For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation. |
|
|
|
## Model Card Authors |
|
|
|
Yiqiao Yin |
|
|
|
## Model Card Contact |
|
|
|
For questions or issues regarding this model, please contact the model author or create an issue in the model repository. |