eagle0504's picture
Update README.md
f2f3210 verified
---
library_name: transformers
tags:
- video-classification
- vjepa2
- computer-vision
- video-understanding
- fine-tuned
- pytorch
---
# Model Card for VJEPA2 Fine-tuned Video Classification Model
This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.
## Model Details
### Model Description
This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.
- **Developed by:** Yiqiao Yin
- **Funded by:** Yiqiao Yin
- **Model type:** Video Classification
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** qubvel-hf/vjepa2-vitl-fpc16-256-ssv2
### Model Sources
- **Repository:** [More Information Needed]
- **Paper:** [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243)
- **Base Model:** [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2)
## Uses
### Direct Use
This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.
### Downstream Use
The model can be further fine-tuned for specific video understanding tasks such as:
- Action recognition
- Video content classification
- Temporal activity detection
- Video scene understanding
### Out-of-Scope Use
This model is not intended for:
- Real-time video processing applications requiring sub-second inference
- High-resolution video analysis beyond the training resolution
- Audio-based video classification (visual features only)
- Video generation or synthesis tasks
## Bias, Risks, and Limitations
The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.
### Recommendations
Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.
## How to Get Started with the Model
Use the code below to get started with the model:
```python
import torch
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification
# Load the model and processor
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
model = VJEPA2ForVideoClassification.from_pretrained(
model_name,
torch_dtype=torch.float32,
label2id=label2id, # Your label mapping
id2label=id2label, # Your ID to label mapping
ignore_mismatched_sizes=True,
).to("cuda")
# Process video and get predictions
inputs = processor(video_data, return_tensors="pt").to(model.device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
```
## Training Details
### Training Data
The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.
### Training Procedure
#### Preprocessing
Videos are processed using the VJEPA2VideoProcessor, which handles:
- Video frame extraction and normalization
- Temporal sampling
- Spatial resizing and augmentation
- Tensor conversion for model input
#### Training Hyperparameters
- **Training regime:** FP32 precision
- **Optimizer:** Adam
- **Learning rate:** 1e-5
- **Epochs:** 5
- **Gradient accumulation steps:** 4
- **Backbone freezing:** VJEPA2 backbone parameters frozen, only classification head trained
- **Batch processing:** Gradient accumulation for effective larger batch size
#### Training Configuration
```python
# Freeze backbone parameters
for param in model.vjepa2.parameters():
param.requires_grad = False
# Only train classification head
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=1e-5)
```
#### Speeds, Sizes, Times
- **Training time:** Depends on dataset size and hardware
- **GPU memory:** Optimized through gradient accumulation
- **Effective batch size:** Original batch size × 4 (due to gradient accumulation)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.
#### Factors
Evaluation considers:
- Video content diversity
- Temporal complexity
- Visual quality variations
- Classification difficulty across different classes
#### Metrics
- **Primary metric:** Classification Accuracy
- **Validation:** Per-epoch validation accuracy
- **Final evaluation:** Test set accuracy
### Results
The model's performance is monitored through:
- Training loss progression with gradient accumulation
- Validation accuracy per epoch
- Final test accuracy
- TensorBoard logging for comprehensive monitoring
## Model Examination
The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:
- Preserves pre-trained video understanding capabilities
- Reduces computational requirements
- Prevents overfitting on smaller datasets
- Enables efficient domain adaptation
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** NVIDIA GPU (CUDA-enabled)
- **Hours used:** Dependent on dataset size and training configuration
- **Training efficiency:** Optimized through gradient accumulation and backbone freezing
- **Carbon Emitted:** Reduced due to efficient fine-tuning approach
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** VJEPA2 (Video Joint Embedding Predictive Architecture)
- **Model Size:** ViT-Large with 16-frame processing capability
- **Input Resolution:** 256x256 pixels
- **Temporal Sampling:** 16 frames per video
- **Classification Head:** Custom layer adapted to target classes
- **Objective:** Cross-entropy loss for multi-class classification
### Compute Infrastructure
#### Hardware
- **GPU:** NVIDIA CUDA-compatible GPU
- **Memory:** Sufficient VRAM for model and gradient accumulation
- **Compute Capability:** CUDA support required
#### Software
- **Framework:** PyTorch
- **Library:** Transformers (Hugging Face)
- **Dependencies:**
- torch
- transformers
- VJEPA2VideoProcessor
- VJEPA2ForVideoClassification
## Citation
**BibTeX:**
```bibtex
@article{bardes2024vjepa,
title={V-JEPA: Video Joint Embedding Predictive Architecture},
author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
journal={arXiv preprint arXiv:2301.08243},
year={2024}
}
```
**APA:**
Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.
## Glossary
- **VJEPA2:** Video Joint Embedding Predictive Architecture, second version
- **Gradient Accumulation:** Technique to simulate larger batch sizes by accumulating gradients over multiple steps
- **Backbone Freezing:** Training strategy where pre-trained layers are frozen and only task-specific layers are trained
- **Video Classification:** Task of assigning categorical labels to video sequences
## More Information
For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.
## Model Card Authors
Yiqiao Yin
## Model Card Contact
For questions or issues regarding this model, please contact the model author or create an issue in the model repository.