ClimbAI V2: Video-Language Model for Bouldering Proficiency Analysis

Model Version: V2

Model Description

ClimbAI integrates:

Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
Vision Encoder: facebook/timesformer-base-finetuned-k600 with LoRA adapters
Custom Video Adapter: AttentiveProjector with multi-head attention for view integration

Multi-view support: Processes 1 camera view(s) simultaneously
Temporal modeling: Analyzes 32 frames per video
Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)

Vision LoRA: Applied LoRA adapters to TimesFormer vision encoder for better video understanding
Flexible Frame Count: Supports different number of frames through time embedding interpolation
Enhanced Sampling: Efficient segment-based frame sampling for better temporal coverage
Dual LoRA: Both LLM and Vision encoder use LoRA for efficient fine-tuning

Video Input (B, V, T, C, H, W) → TimesFormer(+LoRA) → AttentiveProjector → LLM(+LoRA) → Text Analysis

Where:

This model is released under the Apache 2.0 License.

Base model

Quantized

Adapter

(25)

this model