AudioMCQ-Mixed-to-Strong
Overview
This repository contains the Mixed-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.
Training Paradigm
The Mixed-to-Strong training paradigm follows a two-stage approach:
Stage 1: SFT on mixed audio-contribution data (weak + strong)
Stage 2: GRPO (RL) on strong audio-contribution data
This paradigm leverages both weak and strong audio-contribution samples during supervised fine-tuning, followed by reinforcement learning on challenging strong audio-contribution samples to achieve optimal performance.
Model Details
- Base Model: Qwen2.5-Omni
- Training Data: AudioMCQ Dataset (571k samples)
- Training Stages:
- Stage 1 (SFT): Mixed audio-contribution subset
- Stage 2 (GRPO): Strong audio-contribution subset
- System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
Usage
Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the official documentation.
Input Format
The evaluation input prompt structure is:
[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>.
Example Usage
# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above
Performance
The Mixed-to-Strong model achieves superior performance across multiple benchmarks:
- MMAU-test-mini: State-of-the-art accuracy on general audio understanding
- MMAR: Strong performance on music understanding tasks
- MMSU: Excellent results on speech understanding
- Strong Audio-Contribution Splits: Significantly improved performance on challenging samples requiring deep audio understanding
For detailed performance metrics and comparisons, please refer to our paper.
Related Resources
- AudioMCQ Dataset: https://huggingface.co/datasets/inclusionAI/AudioMCQ
- Weak-to-Strong Checkpoint: https://huggingface.co/inclusionAI/AudioMCQ-Weak-To-Strong
- Paper: arXiv:2509.21060
- DCASE 2025 Challenge: http://dcase.community/challenge2025/
Citation
If you find this model useful in your research, please cite:
@article{he2025audiomcq,
title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
author={He, Haolin and others},
journal={arXiv preprint arXiv:2509.21060},
year={2025}
}
Contact
- Haolin He: [email protected]
Acknowledgements
We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.
- Downloads last month
- 25