SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Resources

Model Overview

SecoustiCodec is a low-bitrate streaming speech codec that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:

🧠 Cross-modal alignment: Aligns text and speech in joint multimodal frame-level space
🔍 Semantic-paralinguistic disentanglement: Separates linguistic content from speaker characteristics
⚡ Streaming support: Real-time processing capabilities
📊 Efficient quantization: VAE+FSQ approach solves token distribution problems

Architecture Overview

Acknowledgments

We used HiFiGAN for efficient waveform generation
We referred to MIMICodec to implement this.

Citation

@article{qiang2025secousticodec,
  title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
  author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
  journal={arXiv preprint arXiv:2508.02849},
  year={2025}
}