AV-Reasoner
Model Summary
AV-Reasoner is an audio-visual MLLM built upon Ola-7b. It is trained using GRPO and curriculum learning. It serves as a strong baseline for the CG-AV-Counting benchmark. AV-Reasoner possesses robust capabilities in audio-visual understanding, temporal grounding, and spatial grounding, among others, and achieves state-of-the-art results across multiple benchmarks.
Train
If you want to train a model similar to AV-Reasoner using GRPO, you can refer to the codebase at https://github.com/AV-Reasoner/AV-Reasoner.
Use
You can refer to the Ola tutorial to get started with AV-Reasoner.
- Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.
- Replace the path in config.json with local path of speech encoders.
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support