EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
πββ¬ Github ο½ π Paper ο½ πΌ Online Demo (8B)
Model Description
EchoX is a Speech-to-Speech large language model that addresses the acoustic-semantic gap. This is the 3B version. By introducing Echo Training, EchoX integrates semantic and acoustic learning, mitigating the degradation of reasoning ability observed in existing speech-based LLMs. It is trained on only 10k hours of data while delivering state-of-the-art results in knowledge-based question answering and speech interaction tasks.
Key Features
- Mitigates Acoustic-Semantic Gap in Speech-to-Speech LLMs
- Introduces Echo Training with a Novel Three-Stage Pipeline (S2T, T2C, Echo)
- Trained on Only 10k Hours of Curated Data, Ensuring Efficiency
- Achieves State-of-the-Art Performance in Knowledge-Based QA Benchmarks
- Preserves Reasoning and Knowledge Abilities for Interactive Speech Tasks
Usage
Load the EchoX model and run inference with your audio files as shown in the GitHub repository.
π Citation
@inproceedings{zhang2026echox,
title={EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs},
author={Zhang, Yuhao and Du, Yuhao and Dai, Zhanchen and others},
booktitle={Proceedings of ICLR 2026},
year={2026},
url={https://arxiv.org/abs/XXXX.XXXX}
}
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support