File size: 8,122 Bytes
2b43d27
 
f2f3210
 
 
 
 
 
 
2b43d27
 
f2f3210
2b43d27
f2f3210
2b43d27
 
 
 
 
f2f3210
2b43d27
f2f3210
 
 
 
 
 
2b43d27
f2f3210
2b43d27
 
f2f3210
 
2b43d27
 
 
 
 
f2f3210
2b43d27
f2f3210
2b43d27
f2f3210
 
 
 
 
2b43d27
 
 
f2f3210
 
 
 
 
2b43d27
 
 
f2f3210
2b43d27
 
 
f2f3210
2b43d27
 
 
f2f3210
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b43d27
 
 
 
 
f2f3210
2b43d27
 
 
f2f3210
2b43d27
f2f3210
 
 
 
 
2b43d27
f2f3210
2b43d27
f2f3210
 
 
 
 
 
 
2b43d27
f2f3210
2b43d27
f2f3210
 
 
 
2b43d27
f2f3210
 
 
 
2b43d27
f2f3210
2b43d27
f2f3210
 
 
2b43d27
 
 
 
 
 
 
f2f3210
2b43d27
 
 
f2f3210
 
 
 
 
2b43d27
 
 
f2f3210
 
 
2b43d27
 
 
f2f3210
 
 
 
 
2b43d27
f2f3210
2b43d27
f2f3210
 
 
 
 
2b43d27
 
 
 
 
f2f3210
 
 
 
2b43d27
f2f3210
2b43d27
 
 
f2f3210
 
 
 
 
 
2b43d27
 
 
 
 
f2f3210
 
 
2b43d27
 
 
f2f3210
 
 
 
 
 
 
2b43d27
f2f3210
2b43d27
 
 
f2f3210
 
 
 
 
 
 
 
2b43d27
 
 
f2f3210
2b43d27
f2f3210
2b43d27
f2f3210
 
 
 
2b43d27
f2f3210
2b43d27
f2f3210
2b43d27
f2f3210
2b43d27
f2f3210
2b43d27
 
 
f2f3210
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
library_name: transformers
tags:
- video-classification
- vjepa2
- computer-vision
- video-understanding
- fine-tuned
- pytorch
---

# Model Card for VJEPA2 Fine-tuned Video Classification Model

This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.

## Model Details

### Model Description

This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.

- **Developed by:** Yiqiao Yin
- **Funded by:** Yiqiao Yin
- **Model type:** Video Classification
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** qubvel-hf/vjepa2-vitl-fpc16-256-ssv2

### Model Sources

- **Repository:** [More Information Needed]
- **Paper:** [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243)
- **Base Model:** [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2)

## Uses

### Direct Use

This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.

### Downstream Use

The model can be further fine-tuned for specific video understanding tasks such as:
- Action recognition
- Video content classification
- Temporal activity detection
- Video scene understanding

### Out-of-Scope Use

This model is not intended for:
- Real-time video processing applications requiring sub-second inference
- High-resolution video analysis beyond the training resolution
- Audio-based video classification (visual features only)
- Video generation or synthesis tasks

## Bias, Risks, and Limitations

The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.

### Recommendations

Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.

## How to Get Started with the Model

Use the code below to get started with the model:

```python
import torch
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification

# Load the model and processor
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
model = VJEPA2ForVideoClassification.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    label2id=label2id,  # Your label mapping
    id2label=id2label,  # Your ID to label mapping
    ignore_mismatched_sizes=True,
).to("cuda")

# Process video and get predictions
inputs = processor(video_data, return_tensors="pt").to(model.device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
```

## Training Details

### Training Data

The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.

### Training Procedure

#### Preprocessing

Videos are processed using the VJEPA2VideoProcessor, which handles:
- Video frame extraction and normalization
- Temporal sampling
- Spatial resizing and augmentation
- Tensor conversion for model input

#### Training Hyperparameters

- **Training regime:** FP32 precision
- **Optimizer:** Adam
- **Learning rate:** 1e-5
- **Epochs:** 5
- **Gradient accumulation steps:** 4
- **Backbone freezing:** VJEPA2 backbone parameters frozen, only classification head trained
- **Batch processing:** Gradient accumulation for effective larger batch size

#### Training Configuration

```python
# Freeze backbone parameters
for param in model.vjepa2.parameters():
    param.requires_grad = False

# Only train classification head
trainable = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(trainable, lr=1e-5)
```

#### Speeds, Sizes, Times

- **Training time:** Depends on dataset size and hardware
- **GPU memory:** Optimized through gradient accumulation
- **Effective batch size:** Original batch size × 4 (due to gradient accumulation)

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.

#### Factors

Evaluation considers:
- Video content diversity
- Temporal complexity
- Visual quality variations
- Classification difficulty across different classes

#### Metrics

- **Primary metric:** Classification Accuracy
- **Validation:** Per-epoch validation accuracy
- **Final evaluation:** Test set accuracy

### Results

The model's performance is monitored through:
- Training loss progression with gradient accumulation
- Validation accuracy per epoch
- Final test accuracy
- TensorBoard logging for comprehensive monitoring

## Model Examination

The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:
- Preserves pre-trained video understanding capabilities
- Reduces computational requirements
- Prevents overfitting on smaller datasets
- Enables efficient domain adaptation

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** NVIDIA GPU (CUDA-enabled)
- **Hours used:** Dependent on dataset size and training configuration
- **Training efficiency:** Optimized through gradient accumulation and backbone freezing
- **Carbon Emitted:** Reduced due to efficient fine-tuning approach

## Technical Specifications

### Model Architecture and Objective

- **Base Architecture:** VJEPA2 (Video Joint Embedding Predictive Architecture)
- **Model Size:** ViT-Large with 16-frame processing capability
- **Input Resolution:** 256x256 pixels
- **Temporal Sampling:** 16 frames per video
- **Classification Head:** Custom layer adapted to target classes
- **Objective:** Cross-entropy loss for multi-class classification

### Compute Infrastructure

#### Hardware

- **GPU:** NVIDIA CUDA-compatible GPU
- **Memory:** Sufficient VRAM for model and gradient accumulation
- **Compute Capability:** CUDA support required

#### Software

- **Framework:** PyTorch
- **Library:** Transformers (Hugging Face)
- **Dependencies:** 
  - torch
  - transformers
  - VJEPA2VideoProcessor
  - VJEPA2ForVideoClassification

## Citation

**BibTeX:**

```bibtex
@article{bardes2024vjepa,
  title={V-JEPA: Video Joint Embedding Predictive Architecture},
  author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
  journal={arXiv preprint arXiv:2301.08243},
  year={2024}
}
```

**APA:**

Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.

## Glossary

- **VJEPA2:** Video Joint Embedding Predictive Architecture, second version
- **Gradient Accumulation:** Technique to simulate larger batch sizes by accumulating gradients over multiple steps
- **Backbone Freezing:** Training strategy where pre-trained layers are frozen and only task-specific layers are trained
- **Video Classification:** Task of assigning categorical labels to video sequences

## More Information

For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.

## Model Card Authors

Yiqiao Yin

## Model Card Contact

For questions or issues regarding this model, please contact the model author or create an issue in the model repository.