Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,244 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
-
# Model Card for Model
|
7 |
-
|
8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
-
|
10 |
|
|
|
11 |
|
12 |
## Model Details
|
13 |
|
14 |
### Model Description
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
-
|
20 |
-
- **Developed by:** [More Information Needed]
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
|
31 |
|
32 |
- **Repository:** [More Information Needed]
|
33 |
-
- **Paper
|
34 |
-
- **
|
35 |
|
36 |
## Uses
|
37 |
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
-
|
40 |
### Direct Use
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
[More Information Needed]
|
45 |
-
|
46 |
-
### Downstream Use [optional]
|
47 |
|
48 |
-
|
49 |
|
50 |
-
|
|
|
|
|
|
|
|
|
51 |
|
52 |
### Out-of-Scope Use
|
53 |
|
54 |
-
|
55 |
-
|
56 |
-
|
|
|
|
|
57 |
|
58 |
## Bias, Risks, and Limitations
|
59 |
|
60 |
-
|
61 |
-
|
62 |
-
[More Information Needed]
|
63 |
|
64 |
### Recommendations
|
65 |
|
66 |
-
|
67 |
-
|
68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
69 |
|
70 |
## How to Get Started with the Model
|
71 |
|
72 |
-
Use the code below to get started with the model
|
73 |
-
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
## Training Details
|
77 |
|
78 |
### Training Data
|
79 |
|
80 |
-
|
81 |
-
|
82 |
-
[More Information Needed]
|
83 |
|
84 |
### Training Procedure
|
85 |
|
86 |
-
|
87 |
|
88 |
-
|
|
|
|
|
|
|
|
|
89 |
|
90 |
-
|
91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
-
#### Training
|
94 |
|
95 |
-
|
|
|
|
|
|
|
96 |
|
97 |
-
|
|
|
|
|
|
|
98 |
|
99 |
-
|
100 |
|
101 |
-
|
|
|
|
|
102 |
|
103 |
## Evaluation
|
104 |
|
105 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
106 |
-
|
107 |
### Testing Data, Factors & Metrics
|
108 |
|
109 |
#### Testing Data
|
110 |
|
111 |
-
|
112 |
-
|
113 |
-
[More Information Needed]
|
114 |
|
115 |
#### Factors
|
116 |
|
117 |
-
|
118 |
-
|
119 |
-
|
|
|
|
|
120 |
|
121 |
#### Metrics
|
122 |
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
|
127 |
### Results
|
128 |
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
|
|
133 |
|
|
|
134 |
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
|
141 |
## Environmental Impact
|
142 |
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
|
147 |
-
- **Hardware Type:**
|
148 |
-
- **Hours used:**
|
149 |
-
- **
|
150 |
-
- **
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
|
153 |
-
## Technical Specifications
|
154 |
|
155 |
### Model Architecture and Objective
|
156 |
|
157 |
-
|
|
|
|
|
|
|
|
|
|
|
158 |
|
159 |
### Compute Infrastructure
|
160 |
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
#### Hardware
|
164 |
|
165 |
-
|
|
|
|
|
166 |
|
167 |
#### Software
|
168 |
|
169 |
-
|
170 |
-
|
171 |
-
|
|
|
|
|
|
|
|
|
172 |
|
173 |
-
|
174 |
|
175 |
**BibTeX:**
|
176 |
|
177 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
178 |
|
179 |
**APA:**
|
180 |
|
181 |
-
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
|
185 |
-
|
186 |
|
187 |
-
|
|
|
|
|
|
|
188 |
|
189 |
-
## More Information
|
190 |
|
191 |
-
|
192 |
|
193 |
-
## Model Card Authors
|
194 |
|
195 |
-
|
196 |
|
197 |
## Model Card Contact
|
198 |
|
199 |
-
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
tags:
|
4 |
+
- video-classification
|
5 |
+
- vjepa2
|
6 |
+
- computer-vision
|
7 |
+
- video-understanding
|
8 |
+
- fine-tuned
|
9 |
+
- pytorch
|
10 |
---
|
11 |
|
12 |
+
# Model Card for VJEPA2 Fine-tuned Video Classification Model
|
|
|
|
|
|
|
13 |
|
14 |
+
This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.
|
15 |
|
16 |
## Model Details
|
17 |
|
18 |
### Model Description
|
19 |
|
20 |
+
This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
+
- **Developed by:** Yiqiao Yin
|
23 |
+
- **Funded by:** Yiqiao Yin
|
24 |
+
- **Model type:** Video Classification
|
25 |
+
- **Language(s) (NLP):** English
|
26 |
+
- **License:** Apache 2.0
|
27 |
+
- **Finetuned from model:** qubvel-hf/vjepa2-vitl-fpc16-256-ssv2
|
28 |
|
29 |
+
### Model Sources
|
30 |
|
31 |
- **Repository:** [More Information Needed]
|
32 |
+
- **Paper:** [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243)
|
33 |
+
- **Base Model:** [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2)
|
34 |
|
35 |
## Uses
|
36 |
|
|
|
|
|
37 |
### Direct Use
|
38 |
|
39 |
+
This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.
|
|
|
|
|
|
|
|
|
40 |
|
41 |
+
### Downstream Use
|
42 |
|
43 |
+
The model can be further fine-tuned for specific video understanding tasks such as:
|
44 |
+
- Action recognition
|
45 |
+
- Video content classification
|
46 |
+
- Temporal activity detection
|
47 |
+
- Video scene understanding
|
48 |
|
49 |
### Out-of-Scope Use
|
50 |
|
51 |
+
This model is not intended for:
|
52 |
+
- Real-time video processing applications requiring sub-second inference
|
53 |
+
- High-resolution video analysis beyond the training resolution
|
54 |
+
- Audio-based video classification (visual features only)
|
55 |
+
- Video generation or synthesis tasks
|
56 |
|
57 |
## Bias, Risks, and Limitations
|
58 |
|
59 |
+
The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.
|
|
|
|
|
60 |
|
61 |
### Recommendations
|
62 |
|
63 |
+
Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.
|
|
|
|
|
64 |
|
65 |
## How to Get Started with the Model
|
66 |
|
67 |
+
Use the code below to get started with the model:
|
68 |
+
|
69 |
+
```python
|
70 |
+
import torch
|
71 |
+
from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification
|
72 |
+
|
73 |
+
# Load the model and processor
|
74 |
+
model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
|
75 |
+
processor = VJEPA2VideoProcessor.from_pretrained(model_name)
|
76 |
+
model = VJEPA2ForVideoClassification.from_pretrained(
|
77 |
+
model_name,
|
78 |
+
torch_dtype=torch.float32,
|
79 |
+
label2id=label2id, # Your label mapping
|
80 |
+
id2label=id2label, # Your ID to label mapping
|
81 |
+
ignore_mismatched_sizes=True,
|
82 |
+
).to("cuda")
|
83 |
+
|
84 |
+
# Process video and get predictions
|
85 |
+
inputs = processor(video_data, return_tensors="pt").to(model.device)
|
86 |
+
outputs = model(**inputs)
|
87 |
+
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
88 |
+
```
|
89 |
|
90 |
## Training Details
|
91 |
|
92 |
### Training Data
|
93 |
|
94 |
+
The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.
|
|
|
|
|
95 |
|
96 |
### Training Procedure
|
97 |
|
98 |
+
#### Preprocessing
|
99 |
|
100 |
+
Videos are processed using the VJEPA2VideoProcessor, which handles:
|
101 |
+
- Video frame extraction and normalization
|
102 |
+
- Temporal sampling
|
103 |
+
- Spatial resizing and augmentation
|
104 |
+
- Tensor conversion for model input
|
105 |
|
106 |
+
#### Training Hyperparameters
|
107 |
|
108 |
+
- **Training regime:** FP32 precision
|
109 |
+
- **Optimizer:** Adam
|
110 |
+
- **Learning rate:** 1e-5
|
111 |
+
- **Epochs:** 5
|
112 |
+
- **Gradient accumulation steps:** 4
|
113 |
+
- **Backbone freezing:** VJEPA2 backbone parameters frozen, only classification head trained
|
114 |
+
- **Batch processing:** Gradient accumulation for effective larger batch size
|
115 |
|
116 |
+
#### Training Configuration
|
117 |
|
118 |
+
```python
|
119 |
+
# Freeze backbone parameters
|
120 |
+
for param in model.vjepa2.parameters():
|
121 |
+
param.requires_grad = False
|
122 |
|
123 |
+
# Only train classification head
|
124 |
+
trainable = [p for p in model.parameters() if p.requires_grad]
|
125 |
+
optimizer = torch.optim.Adam(trainable, lr=1e-5)
|
126 |
+
```
|
127 |
|
128 |
+
#### Speeds, Sizes, Times
|
129 |
|
130 |
+
- **Training time:** Depends on dataset size and hardware
|
131 |
+
- **GPU memory:** Optimized through gradient accumulation
|
132 |
+
- **Effective batch size:** Original batch size × 4 (due to gradient accumulation)
|
133 |
|
134 |
## Evaluation
|
135 |
|
|
|
|
|
136 |
### Testing Data, Factors & Metrics
|
137 |
|
138 |
#### Testing Data
|
139 |
|
140 |
+
The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.
|
|
|
|
|
141 |
|
142 |
#### Factors
|
143 |
|
144 |
+
Evaluation considers:
|
145 |
+
- Video content diversity
|
146 |
+
- Temporal complexity
|
147 |
+
- Visual quality variations
|
148 |
+
- Classification difficulty across different classes
|
149 |
|
150 |
#### Metrics
|
151 |
|
152 |
+
- **Primary metric:** Classification Accuracy
|
153 |
+
- **Validation:** Per-epoch validation accuracy
|
154 |
+
- **Final evaluation:** Test set accuracy
|
155 |
|
156 |
### Results
|
157 |
|
158 |
+
The model's performance is monitored through:
|
159 |
+
- Training loss progression with gradient accumulation
|
160 |
+
- Validation accuracy per epoch
|
161 |
+
- Final test accuracy
|
162 |
+
- TensorBoard logging for comprehensive monitoring
|
163 |
|
164 |
+
## Model Examination
|
165 |
|
166 |
+
The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:
|
167 |
+
- Preserves pre-trained video understanding capabilities
|
168 |
+
- Reduces computational requirements
|
169 |
+
- Prevents overfitting on smaller datasets
|
170 |
+
- Enables efficient domain adaptation
|
171 |
|
172 |
## Environmental Impact
|
173 |
|
|
|
|
|
174 |
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
175 |
|
176 |
+
- **Hardware Type:** NVIDIA GPU (CUDA-enabled)
|
177 |
+
- **Hours used:** Dependent on dataset size and training configuration
|
178 |
+
- **Training efficiency:** Optimized through gradient accumulation and backbone freezing
|
179 |
+
- **Carbon Emitted:** Reduced due to efficient fine-tuning approach
|
|
|
180 |
|
181 |
+
## Technical Specifications
|
182 |
|
183 |
### Model Architecture and Objective
|
184 |
|
185 |
+
- **Base Architecture:** VJEPA2 (Video Joint Embedding Predictive Architecture)
|
186 |
+
- **Model Size:** ViT-Large with 16-frame processing capability
|
187 |
+
- **Input Resolution:** 256x256 pixels
|
188 |
+
- **Temporal Sampling:** 16 frames per video
|
189 |
+
- **Classification Head:** Custom layer adapted to target classes
|
190 |
+
- **Objective:** Cross-entropy loss for multi-class classification
|
191 |
|
192 |
### Compute Infrastructure
|
193 |
|
|
|
|
|
194 |
#### Hardware
|
195 |
|
196 |
+
- **GPU:** NVIDIA CUDA-compatible GPU
|
197 |
+
- **Memory:** Sufficient VRAM for model and gradient accumulation
|
198 |
+
- **Compute Capability:** CUDA support required
|
199 |
|
200 |
#### Software
|
201 |
|
202 |
+
- **Framework:** PyTorch
|
203 |
+
- **Library:** Transformers (Hugging Face)
|
204 |
+
- **Dependencies:**
|
205 |
+
- torch
|
206 |
+
- transformers
|
207 |
+
- VJEPA2VideoProcessor
|
208 |
+
- VJEPA2ForVideoClassification
|
209 |
|
210 |
+
## Citation
|
211 |
|
212 |
**BibTeX:**
|
213 |
|
214 |
+
```bibtex
|
215 |
+
@article{bardes2024vjepa,
|
216 |
+
title={V-JEPA: Video Joint Embedding Predictive Architecture},
|
217 |
+
author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
|
218 |
+
journal={arXiv preprint arXiv:2301.08243},
|
219 |
+
year={2024}
|
220 |
+
}
|
221 |
+
```
|
222 |
|
223 |
**APA:**
|
224 |
|
225 |
+
Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.
|
|
|
|
|
226 |
|
227 |
+
## Glossary
|
228 |
|
229 |
+
- **VJEPA2:** Video Joint Embedding Predictive Architecture, second version
|
230 |
+
- **Gradient Accumulation:** Technique to simulate larger batch sizes by accumulating gradients over multiple steps
|
231 |
+
- **Backbone Freezing:** Training strategy where pre-trained layers are frozen and only task-specific layers are trained
|
232 |
+
- **Video Classification:** Task of assigning categorical labels to video sequences
|
233 |
|
234 |
+
## More Information
|
235 |
|
236 |
+
For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.
|
237 |
|
238 |
+
## Model Card Authors
|
239 |
|
240 |
+
Yiqiao Yin
|
241 |
|
242 |
## Model Card Contact
|
243 |
|
244 |
+
For questions or issues regarding this model, please contact the model author or create an issue in the model repository.
|