You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AV-Reasoner

hf_checkpoint hf_data arXiv Webpage

Model Summary

AV-Reasoner is an audio-visual MLLM built upon Ola-7b. It is trained using GRPO and curriculum learning. It serves as a strong baseline for the CG-AV-Counting benchmark. AV-Reasoner possesses robust capabilities in audio-visual understanding, temporal grounding, and spatial grounding, among others, and achieves state-of-the-art results across multiple benchmarks.

Train

If you want to train a model similar to AV-Reasoner using GRPO, you can refer to the codebase at https://github.com/AV-Reasoner/AV-Reasoner.

Use

You can refer to the Ola tutorial to get started with AV-Reasoner.

  1. Download the speech encoder at https://huggingface.co/THUdyh/Ola_speech_encoders.
  2. Replace the path in config.json with local path of speech encoders.
Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lulidong/AV-Reasoner-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
THUdyh/Ola-7b
Finetuned
(1)
this model