EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

🐈‍⬛ Github ｜ 📃 Paper ｜ 📼 Online Demo (8B)

Model Description

EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.

Key Features

Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs

Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)

Trained on Only 10k Hours of Curated Data, Ensuring Efficiency

Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks

Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks

Usage

Load the EchoX model and run inference with your audio files as shown in the GitHub repository.

📖 Citation

@inproceedings{zhang2026echox,
  title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
  author={Zhang, Yuhao and Du, Yuhao and Dai, Zhanchen and others},
  booktitle={Proceedings of ICLR 2026},
  year={2026},
  url={https://arxiv.org/abs/XXXX.XXXX}
}