ClimbAI V2: Video-Language Model for Bouldering Proficiency Analysis

Model Version: V2

Model Description

ClimbAI integrates:

  • Language Model: HuggingFaceTB/SmolLM2-135M-Instruct with LoRA adapters
  • Vision Encoder: facebook/timesformer-base-finetuned-k600 with LoRA adapters
  • Custom Video Adapter: AttentiveProjector with multi-head attention for view integration

Key Features

  • Multi-view support: Processes 1 camera view(s) simultaneously
  • Temporal modeling: Analyzes 32 frames per video
  • Proficiency assessment: Classifies performance levels (Novice, Early Expert, Intermediate Expert, Late Expert)
  • Sport agnostic: Trained on multiple sports (basketball, cooking, dance, bouldering, soccer, music)

Version 2 Features

  • Vision LoRA: Applied LoRA adapters to TimesFormer vision encoder for better video understanding
  • Flexible Frame Count: Supports different number of frames through time embedding interpolation
  • Enhanced Sampling: Efficient segment-based frame sampling for better temporal coverage
  • Dual LoRA: Both LLM and Vision encoder use LoRA for efficient fine-tuning

Model Architecture

Video Input (B, V, T, C, H, W) β†’ TimesFormer(+LoRA) β†’ AttentiveProjector β†’ LLM(+LoRA) β†’ Text Analysis

Where:

  • B: Batch size
  • V: Number of views (1)
  • T: Number of frames (32)
  • C, H, W: Channel, Height, Width

License

This model is released under the Apache 2.0 License.

Acknowledgments

  • Base LLM: HuggingFaceTB/SmolLM2-135M-Instruct
  • Vision Encoder: facebook/timesformer-base-finetuned-k600
  • Built with πŸ€— Transformers and PyTorch
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EdBianchi/ClimbAIv1

Adapter
(25)
this model