AV-Reasoner

Model Summary

AV-Reasoner is an audio-visual MLLM built upon Ola-7b. It is trained using GRPO and curriculum learning. It serves as a strong baseline for the CG-AV-Counting benchmark. AV-Reasoner possesses robust capabilities in audio-visual understanding, temporal grounding, and spatial grounding, among others, and achieves state-of-the-art results across multiple benchmarks.

Train

If you want to train a model similar to AV-Reasoner using GRPO, you can refer to the codebase at https://github.com/AV-Reasoner/AV-Reasoner.

Use

You can refer to the Ola tutorial to get started with AV-Reasoner.

Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.

Replace the path in config.json with local path of speech encoders.

lulidong
/

AV-Reasoner-7B

You need to agree to share your contact information to access this model

AV-Reasoner

Model Summary

Train

Use

Model tree for lulidong/AV-Reasoner-7B