Speech Detokenizer
Our detokenizer is developed based on the F5-TTS framework and features two specific improvements.
The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of GLM-4-Voice.
A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.