Text2Token(T2U)
The Text2Token module is a Transformer-based translation model. It takes phonemes as input, which can be converted from text using the G2P module. The Text2Token model released this time was trained on approximately 380k hours of speech-text paired data with fairseq.