arxiv:2506.08967

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Published on Jun 10

Authors:

Abstract

A fully end-to-end Large Audio-Language Model (Step-Audio-AQAA) with a dual-codebook audio tokenizer, a 130-billion-parameter backbone LLM, and a neural vocoder outperforms existing models in audio response generation and speech control tasks.

AI-generated summary

Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

View arXiv page View PDF Add to collection