AudioMCQ-Mixed-to-Strong

arXiv Dataset DCASE 2025

Overview

This repository contains the Mixed-to-Strong model checkpoint from our paper "Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models". This model demonstrates state-of-the-art performance on audio question-answering benchmarks through our novel audio-contribution-aware post-training approach.

Training Paradigm

The Mixed-to-Strong training paradigm follows a two-stage approach:

Stage 1: SFT on mixed audio-contribution data (weak + strong)
Stage 2: GRPO (RL) on strong audio-contribution data

This paradigm leverages both weak and strong audio-contribution samples during supervised fine-tuning, followed by reinforcement learning on challenging strong audio-contribution samples to achieve optimal performance.

Model Details

  • Base Model: Qwen2.5-Omni
  • Training Data: AudioMCQ Dataset (571k samples)
  • Training Stages:
    • Stage 1 (SFT): Mixed audio-contribution subset
    • Stage 2 (GRPO): Strong audio-contribution subset
  • System Prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."

Usage

Our model loading and usage methods are identical to those of Qwen2.5-Omni. Please refer to the official documentation.

Input Format

The evaluation input prompt structure is:

[Question] Please choose the answer from the following options: [''Option1'', ''Option2'', ''Option3'', ''Option4'']. Output the final answer in <answer> </answer>.

Example Usage

# Load model following Qwen2.5-Omni documentation
# Apply system prompt: "You are an audio understanding model that answers multiple choice questions based on audio content."
# Format your question with the input structure above

Performance

The Mixed-to-Strong model achieves superior performance across multiple benchmarks:

  • MMAU-test-mini: State-of-the-art accuracy on general audio understanding
  • MMAR: Strong performance on music understanding tasks
  • MMSU: Excellent results on speech understanding
  • Strong Audio-Contribution Splits: Significantly improved performance on challenging samples requiring deep audio understanding

For detailed performance metrics and comparisons, please refer to our paper.

Related Resources

Citation

If you find this model useful in your research, please cite:

@article{he2025audiomcq,
  title={Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models},
  author={He, Haolin and others},
  journal={arXiv preprint arXiv:2509.21060},
  year={2025}
}

Contact

Acknowledgements

We thank the organizers of DCASE 2025 and the research community for their valuable feedback and support.

Downloads last month
25
Safetensors
Model size
11B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support