SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Resources

Model Overview

SecoustiCodec is a low-bitrate streaming speech codec that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:

  • 🧠 Cross-modal alignment: Aligns text and speech in joint multimodal frame-level space
  • πŸ” Semantic-paralinguistic disentanglement: Separates linguistic content from speaker characteristics
  • ⚑ Streaming support: Real-time processing capabilities
  • πŸ“Š Efficient quantization: VAE+FSQ approach solves token distribution problems

Architecture Overview

Model Architecture

Acknowledgments

  • We used HiFiGAN for efficient waveform generation
  • We referred to MIMICodec to implement this.

Citation

@article{qiang2025secousticodec,
  title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
  author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
  journal={arXiv preprint arXiv:2508.02849},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support