Russian Vosk TTS model
Version 0.9
Metrics:
CER 0.6 FAD 0.810 UTMOS 3.290 Speaker Similarity 0.875 xRT CPU 0.35 xRT GPU 0.06
License: Apache 2.0
Changelog:
- ASR alignment
- No encoder, just duration predictor
- Slightly thinner predictor width (160) to fit DiT hidden vector
- Scale for diffusion loss (to not dominate on duration loss)