Spaces:

Soul-AILab
/

SoulX-Podcast-1.7B-Dialect

Running on Zero

Apply for community grant: Academic project (gpu and storage)

by tiamojames - opened Oct 30

Soul-AILab org Oct 30

Recent advances in text-to-speech (TTS) synthesis have significantly improved
speech expressiveness and naturalness. However, most existing systems are tailored
for single-speaker synthesis and fall short in generating coherent multi-speaker
conversational speech. This technical report presents SoulX-Podcast, a system
designed for podcast-style multi-turn, multi-speaker dialogic speech generation,
while also achieving state-of-the-art performance in conventional text-to-speech
(TTS) tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both
Mandarin and English, as well as several Chinese dialects, including Sichuanese,
Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously
produce over 90 minutes of conversation with stable speaker timbre and smooth
speaker transitions. Moreover, speakers exhibit contextually adaptive prosody,
reflecting natural rhythm and intonation changes as dialogues progress. Across
multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance
in both monologue TTS and multi-turn conversational speech synthesis

hysts

Oct 30

Hi @tiamojames , we've assigned ZeroGPU to this Space. Please check the compatibility and usage sections of this page so your Space can run on ZeroGPU.
If you can, we ask that you upgrade to Enterprise to enjoy higher ZeroGPU quota and other features like Dev Mode, Private Storage, and more: hf.co/enterprise

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment