vjepa2-vitl-fpc16-256-ssv2-ucf101 / README.md

Update README.md

f2f3210 verified 28 days ago

8.12 kB

	---
	library_name: transformers
	tags:
	- video-classification
	- vjepa2
	- computer-vision
	- video-understanding
	- fine-tuned
	- pytorch
	---

	# Model Card for VJEPA2 Fine-tuned Video Classification Model

	This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training.

	## Model Details

	### Model Description

	This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation.

	- Developed by: Yiqiao Yin
	- Funded by: Yiqiao Yin
	- Model type: Video Classification
	- Language(s) (NLP): English
	- License: Apache 2.0
	- Finetuned from model: qubvel-hf/vjepa2-vitl-fpc16-256-ssv2

	### Model Sources

	- Repository: [More Information Needed]
	- Paper: [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243)
	- Base Model: [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2)

	## Uses

	### Direct Use

	This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone.

	### Downstream Use

	The model can be further fine-tuned for specific video understanding tasks such as:
	- Action recognition
	- Video content classification
	- Temporal activity detection
	- Video scene understanding

	### Out-of-Scope Use

	This model is not intended for:
	- Real-time video processing applications requiring sub-second inference
	- High-resolution video analysis beyond the training resolution
	- Audio-based video classification (visual features only)
	- Video generation or synthesis tasks

	## Bias, Risks, and Limitations

	The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content.

	### Recommendations

	Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups.

	## How to Get Started with the Model

	Use the code below to get started with the model:

	```python
	import torch
	from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification

	# Load the model and processor
	model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2"
	processor = VJEPA2VideoProcessor.from_pretrained(model_name)
	model = VJEPA2ForVideoClassification.from_pretrained(
	model_name,
	torch_dtype=torch.float32,
	label2id=label2id, # Your label mapping
	id2label=id2label, # Your ID to label mapping
	ignore_mismatched_sizes=True,
	).to("cuda")

	# Process video and get predictions
	inputs = processor(video_data, return_tensors="pt").to(model.device)
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	```

	## Training Details

	### Training Data

	The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task.

	### Training Procedure

	#### Preprocessing

	Videos are processed using the VJEPA2VideoProcessor, which handles:
	- Video frame extraction and normalization
	- Temporal sampling
	- Spatial resizing and augmentation
	- Tensor conversion for model input

	#### Training Hyperparameters

	- Training regime: FP32 precision
	- Optimizer: Adam
	- Learning rate: 1e-5
	- Epochs: 5
	- Gradient accumulation steps: 4
	- Backbone freezing: VJEPA2 backbone parameters frozen, only classification head trained
	- Batch processing: Gradient accumulation for effective larger batch size

	#### Training Configuration

	```python
	# Freeze backbone parameters
	for param in model.vjepa2.parameters():
	param.requires_grad = False

	# Only train classification head
	trainable = [p for p in model.parameters() if p.requires_grad]
	optimizer = torch.optim.Adam(trainable, lr=1e-5)
	```

	#### Speeds, Sizes, Times

	- Training time: Depends on dataset size and hardware
	- GPU memory: Optimized through gradient accumulation
	- Effective batch size: Original batch size × 4 (due to gradient accumulation)

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch.

	#### Factors

	Evaluation considers:
	- Video content diversity
	- Temporal complexity
	- Visual quality variations
	- Classification difficulty across different classes

	#### Metrics

	- Primary metric: Classification Accuracy
	- Validation: Per-epoch validation accuracy
	- Final evaluation: Test set accuracy

	### Results

	The model's performance is monitored through:
	- Training loss progression with gradient accumulation
	- Validation accuracy per epoch
	- Final test accuracy
	- TensorBoard logging for comprehensive monitoring

	## Model Examination

	The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach:
	- Preserves pre-trained video understanding capabilities
	- Reduces computational requirements
	- Prevents overfitting on smaller datasets
	- Enables efficient domain adaptation

	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: NVIDIA GPU (CUDA-enabled)
	- Hours used: Dependent on dataset size and training configuration
	- Training efficiency: Optimized through gradient accumulation and backbone freezing
	- Carbon Emitted: Reduced due to efficient fine-tuning approach

	## Technical Specifications

	### Model Architecture and Objective

	- Base Architecture: VJEPA2 (Video Joint Embedding Predictive Architecture)
	- Model Size: ViT-Large with 16-frame processing capability
	- Input Resolution: 256x256 pixels
	- Temporal Sampling: 16 frames per video
	- Classification Head: Custom layer adapted to target classes
	- Objective: Cross-entropy loss for multi-class classification

	### Compute Infrastructure

	#### Hardware

	- GPU: NVIDIA CUDA-compatible GPU
	- Memory: Sufficient VRAM for model and gradient accumulation
	- Compute Capability: CUDA support required

	#### Software

	- Framework: PyTorch
	- Library: Transformers (Hugging Face)
	- Dependencies:
	- torch
	- transformers
	- VJEPA2VideoProcessor
	- VJEPA2ForVideoClassification

	## Citation

	BibTeX:

	```bibtex
	@article{bardes2024vjepa,
	title={V-JEPA: Video Joint Embedding Predictive Architecture},
	author={Bardes, Adrien and Ponce, Jean and LeCun, Yann},
	journal={arXiv preprint arXiv:2301.08243},
	year={2024}
	}
	```

	APA:

	Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.

	## Glossary

	- VJEPA2: Video Joint Embedding Predictive Architecture, second version
	- Gradient Accumulation: Technique to simulate larger batch sizes by accumulating gradients over multiple steps
	- Backbone Freezing: Training strategy where pre-trained layers are frozen and only task-specific layers are trained
	- Video Classification: Task of assigning categorical labels to video sequences

	## More Information

	For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation.

	## Model Card Authors

	Yiqiao Yin

	## Model Card Contact

	For questions or issues regarding this model, please contact the model author or create an issue in the model repository.