File size: 8,122 Bytes
2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 2b43d27 f2f3210 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
---
library_name: transformers
tags:
- video-classification
- vjepa2
- computer-vision
- video-understanding
- fine-tuned
- pytorch
---
# Model Card for VJEPA2 Fine-tuned Video Classification Model
This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.
## Model Details
### Model Description
This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.
- **Developed by:** Yiqiao Yin
- **Funded by:** Yiqiao Yin
- **Model type:** Video Classification
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** qubvel-hf/vjepa2-vitl-fpc16-256-ssv2
### Model Sources
- **Repository:** [More Information Needed]
- **Paper:** [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243)
- **Base Model:** [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2)
## Uses
### Direct Use
This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.
### Downstream Use
The model can be further fine-tuned for specific video understanding tasks such as:
- Action recognition
- Video content classification
- Temporal activity detection
- Video scene understanding
### Out-of-Scope Use
This model is not intended for:
- Real-time video processing applications requiring sub-second inference
- High-resolution video analysis beyond the training resolution
- Audio-based video classification (visual features only)
- Video generation or synthesis tasks
## Bias, Risks, and Limitations
The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.
### Recommendations
Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.
## How to Get Started with the Model
Use the code below to get started with the model:
```python
import torch
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification
# Load the model and processor
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
model = VJEPA2ForVideoClassification.from_pretrained(
model_name,
torch_dtype=torch.float32,
label2id=label2id, # Your label mapping
id2label=id2label, # Your ID to label mapping
ignore_mismatched_sizes=True,
).to("cuda")
# Process video and get predictions
inputs = processor(video_data, return_tensors="pt").to(model.device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
```
## Training Details
### Training Data
The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.
### Training Procedure
#### Preprocessing
Videos are processed using the VJEPA2VideoProcessor, which handles:
- Video frame extraction and normalization
- Temporal sampling
- Spatial resizing and augmentation
- Tensor conversion for model input
#### Training Hyperparameters
- **Training regime:** FP32 precision
- **Optimizer:** Adam
- **Learning rate:** 1e-5
- **Epochs:** 5
- **Gradient accumulation steps:** 4
- **Backbone freezing:** VJEPA2 backbone parameters frozen, only classification head trained
- **Batch processing:** Gradient accumulation for effective larger batch size
#### Training Configuration
```python
# Freeze backbone parameters
for param in model.vjepa2.parameters():
param.requires_grad = False
# Only train classification head
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=1e-5)
```
#### Speeds, Sizes, Times
- **Training time:** Depends on dataset size and hardware
- **GPU memory:** Optimized through gradient accumulation
- **Effective batch size:** Original batch size × 4 (due to gradient accumulation)
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.
#### Factors
Evaluation considers:
- Video content diversity
- Temporal complexity
- Visual quality variations
- Classification difficulty across different classes
#### Metrics
- **Primary metric:** Classification Accuracy
- **Validation:** Per-epoch validation accuracy
- **Final evaluation:** Test set accuracy
### Results
The model's performance is monitored through:
- Training loss progression with gradient accumulation
- Validation accuracy per epoch
- Final test accuracy
- TensorBoard logging for comprehensive monitoring
## Model Examination
The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:
- Preserves pre-trained video understanding capabilities
- Reduces computational requirements
- Prevents overfitting on smaller datasets
- Enables efficient domain adaptation
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** NVIDIA GPU (CUDA-enabled)
- **Hours used:** Dependent on dataset size and training configuration
- **Training efficiency:** Optimized through gradient accumulation and backbone freezing
- **Carbon Emitted:** Reduced due to efficient fine-tuning approach
## Technical Specifications
### Model Architecture and Objective
- **Base Architecture:** VJEPA2 (Video Joint Embedding Predictive Architecture)
- **Model Size:** ViT-Large with 16-frame processing capability
- **Input Resolution:** 256x256 pixels
- **Temporal Sampling:** 16 frames per video
- **Classification Head:** Custom layer adapted to target classes
- **Objective:** Cross-entropy loss for multi-class classification
### Compute Infrastructure
#### Hardware
- **GPU:** NVIDIA CUDA-compatible GPU
- **Memory:** Sufficient VRAM for model and gradient accumulation
- **Compute Capability:** CUDA support required
#### Software
- **Framework:** PyTorch
- **Library:** Transformers (Hugging Face)
- **Dependencies:**
- torch
- transformers
- VJEPA2VideoProcessor
- VJEPA2ForVideoClassification
## Citation
**BibTeX:**
```bibtex
@article{bardes2024vjepa,
title={V-JEPA: Video Joint Embedding Predictive Architecture},
author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
journal={arXiv preprint arXiv:2301.08243},
year={2024}
}
```
**APA:**
Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.
## Glossary
- **VJEPA2:** Video Joint Embedding Predictive Architecture, second version
- **Gradient Accumulation:** Technique to simulate larger batch sizes by accumulating gradients over multiple steps
- **Backbone Freezing:** Training strategy where pre-trained layers are frozen and only task-specific layers are trained
- **Video Classification:** Task of assigning categorical labels to video sequences
## More Information
For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.
## Model Card Authors
Yiqiao Yin
## Model Card Contact
For questions or issues regarding this model, please contact the model author or create an issue in the model repository. |