SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec
Resources
Model Overview
SecoustiCodec is a low-bitrate streaming speech codec that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:
- π§ Cross-modal alignment: Aligns text and speech in joint multimodal frame-level space
- π Semantic-paralinguistic disentanglement: Separates linguistic content from speaker characteristics
- β‘ Streaming support: Real-time processing capabilities
- π Efficient quantization: VAE+FSQ approach solves token distribution problems
Architecture Overview
Acknowledgments
Citation
@article{qiang2025secousticodec,
title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
journal={arXiv preprint arXiv:2508.02849},
year={2025}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support