gooorillax's picture
fix detak readme
c061431

Speech Detokenizer

Our detokenizer is developed based on the F5-TTS framework and features two specific improvements.

  1. The DiT module has been substituted by a DiT variant with cross-attention. It is similar to the detokenizer of GLM-4-Voice.

  2. A chunk-based streaming inference algorithm is developed, it allows the model to generate speech of any length.

The detokenizer released this time was trained on approximately 6,000 hours of Chinese and English data. This dataset includes Wenet4TTS (both premium and standard), LibriTTS, and others.