ClimbAI V2: Video-Language Model for Bouldering Proficiency Analysis
Model Version: V2
Model Description
ClimbAI integrates:
- Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
- Vision Encoder: facebook/timesformer-base-finetuned-k600 with LoRA adapters
- Custom Video Adapter: AttentiveProjector with multi-head attention for view integration
Key Features
- Multi-view support: Processes 1 camera view(s) simultaneously
- Temporal modeling: Analyzes 32 frames per video
- Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
- Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)
Version 2 Features
- Vision LoRA: Applied LoRA adapters to TimesFormer vision encoder for better video understanding
- Flexible Frame Count: Supports different number of frames through time embedding interpolation
- Enhanced Sampling: Efficient segment-based frame sampling for better temporal coverage
- Dual LoRA: Both LLM and Vision encoder use LoRA for efficient fine-tuning
Model Architecture
Video Input (B, V, T, C, H, W) β TimesFormer(+LoRA) β AttentiveProjector β LLM(+LoRA) β Text Analysis
Where:
- B: Batch size
- V: Number of views (1)
- T: Number of frames (32)
- C, H, W: Channel, Height, Width
License
This model is released under the Apache 2.0 License.
Acknowledgments
- Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
- Vision Encoder: facebook/timesformer-base-finetuned-k600
- Built with π€ Transformers and PyTorch
- Downloads last month
- 13
Model tree for EdBianchi/ClimbAIv1
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct