--- library_name: transformers tags: - video-classification - vjepa2 - computer-vision - video-understanding - fine-tuned - pytorch --- # Model Card for VJEPA2 Fine-tuned Video Classification Model This model is a fine-tuned version of Facebook's VJEPA2 (Video Joint Embedding Predictive Architecture) for video classification tasks. The model has been fine-tuned using gradient accumulation and frozen backbone techniques for efficient training. ## Model Details ### Model Description This is a fine-tuned VJEPA2 model specifically adapted for video classification tasks. The model leverages the pre-trained VJEPA2 backbone with a custom classification head, trained using efficient fine-tuning techniques including backbone freezing and gradient accumulation. - **Developed by:** Yiqiao Yin - **Funded by:** Yiqiao Yin - **Model type:** Video Classification - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Finetuned from model:** qubvel-hf/vjepa2-vitl-fpc16-256-ssv2 ### Model Sources - **Repository:** [More Information Needed] - **Paper:** [V-JEPA: Video Joint Embedding Predictive Architecture](https://arxiv.org/abs/2301.08243) - **Base Model:** [qubvel-hf/vjepa2-vitl-fpc16-256-ssv2](https://huggingface.co/qubvel-hf/vjepa2-vitl-fpc16-256-ssv2) ## Uses ### Direct Use This model can be directly used for video classification tasks. It processes video inputs and outputs class predictions based on the learned representations from the VJEPA2 backbone. ### Downstream Use The model can be further fine-tuned for specific video understanding tasks such as: - Action recognition - Video content classification - Temporal activity detection - Video scene understanding ### Out-of-Scope Use This model is not intended for: - Real-time video processing applications requiring sub-second inference - High-resolution video analysis beyond the training resolution - Audio-based video classification (visual features only) - Video generation or synthesis tasks ## Bias, Risks, and Limitations The model inherits biases from the original VJEPA2 pre-training data and may exhibit performance variations across different video domains, lighting conditions, and demographic representations in video content. ### Recommendations Users should evaluate the model's performance on their specific use case and consider additional fine-tuning if the target domain differs significantly from the training data. Monitor for potential biases in video content classification across different demographic groups. ## How to Get Started with the Model Use the code below to get started with the model: ```python import torch from transformers import VJEPA2VideoProcessor, VJEPA2ForVideoClassification # Load the model and processor model_name = "qubvel-hf/vjepa2-vitl-fpc16-256-ssv2" processor = VJEPA2VideoProcessor.from_pretrained(model_name) model = VJEPA2ForVideoClassification.from_pretrained( model_name, torch_dtype=torch.float32, label2id=label2id, # Your label mapping id2label=id2label, # Your ID to label mapping ignore_mismatched_sizes=True, ).to("cuda") # Process video and get predictions inputs = processor(video_data, return_tensors="pt").to(model.device) outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) ``` ## Training Details ### Training Data The model was fine-tuned on a custom video classification dataset. The specific dataset details depend on the user's implementation and target classification task. ### Training Procedure #### Preprocessing Videos are processed using the VJEPA2VideoProcessor, which handles: - Video frame extraction and normalization - Temporal sampling - Spatial resizing and augmentation - Tensor conversion for model input #### Training Hyperparameters - **Training regime:** FP32 precision - **Optimizer:** Adam - **Learning rate:** 1e-5 - **Epochs:** 5 - **Gradient accumulation steps:** 4 - **Backbone freezing:** VJEPA2 backbone parameters frozen, only classification head trained - **Batch processing:** Gradient accumulation for effective larger batch size #### Training Configuration ```python # Freeze backbone parameters for param in model.vjepa2.parameters(): param.requires_grad = False # Only train classification head trainable = [p for p in model.parameters() if p.requires_grad] optimizer = torch.optim.Adam(trainable, lr=1e-5) ``` #### Speeds, Sizes, Times - **Training time:** Depends on dataset size and hardware - **GPU memory:** Optimized through gradient accumulation - **Effective batch size:** Original batch size × 4 (due to gradient accumulation) ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data The model is evaluated on held-out test sets from the training dataset, with validation performed after each epoch. #### Factors Evaluation considers: - Video content diversity - Temporal complexity - Visual quality variations - Classification difficulty across different classes #### Metrics - **Primary metric:** Classification Accuracy - **Validation:** Per-epoch validation accuracy - **Final evaluation:** Test set accuracy ### Results The model's performance is monitored through: - Training loss progression with gradient accumulation - Validation accuracy per epoch - Final test accuracy - TensorBoard logging for comprehensive monitoring ## Model Examination The model uses a frozen VJEPA2 backbone for feature extraction, with only the classification head being trained. This approach: - Preserves pre-trained video understanding capabilities - Reduces computational requirements - Prevents overfitting on smaller datasets - Enables efficient domain adaptation ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** NVIDIA GPU (CUDA-enabled) - **Hours used:** Dependent on dataset size and training configuration - **Training efficiency:** Optimized through gradient accumulation and backbone freezing - **Carbon Emitted:** Reduced due to efficient fine-tuning approach ## Technical Specifications ### Model Architecture and Objective - **Base Architecture:** VJEPA2 (Video Joint Embedding Predictive Architecture) - **Model Size:** ViT-Large with 16-frame processing capability - **Input Resolution:** 256x256 pixels - **Temporal Sampling:** 16 frames per video - **Classification Head:** Custom layer adapted to target classes - **Objective:** Cross-entropy loss for multi-class classification ### Compute Infrastructure #### Hardware - **GPU:** NVIDIA CUDA-compatible GPU - **Memory:** Sufficient VRAM for model and gradient accumulation - **Compute Capability:** CUDA support required #### Software - **Framework:** PyTorch - **Library:** Transformers (Hugging Face) - **Dependencies:** - torch - transformers - VJEPA2VideoProcessor - VJEPA2ForVideoClassification ## Citation **BibTeX:** ```bibtex @article{bardes2024vjepa, title={V-JEPA: Video Joint Embedding Predictive Architecture}, author={Bardes, Adrien and Ponce, Jean and LeCun, Yann}, journal={arXiv preprint arXiv:2301.08243}, year={2024} } ``` **APA:** Bardes, A., Ponce, J., & LeCun, Y. (2024). V-JEPA: Video Joint Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243. ## Glossary - **VJEPA2:** Video Joint Embedding Predictive Architecture, second version - **Gradient Accumulation:** Technique to simulate larger batch sizes by accumulating gradients over multiple steps - **Backbone Freezing:** Training strategy where pre-trained layers are frozen and only task-specific layers are trained - **Video Classification:** Task of assigning categorical labels to video sequences ## More Information For more details on the VJEPA2 architecture and training methodology, refer to the original paper and the base model documentation. ## Model Card Authors Yiqiao Yin ## Model Card Contact For questions or issues regarding this model, please contact the model author or create an issue in the model repository.