Zonos is excellent, but there's one major issue that makes it's use impractical -- the "uh" (and "um") sounds weren't labelled in the dataset
#7
by
jattoedaltni
- opened
I've spoken to a few others about this, I know it's not just me or the specific input .wavs -- I think the training data didn't label "uh" and "um" in people's speech, and thus, random "uh" sounds appear extremely consistently in the outputs.
(And the output does not output a normal "uh" sound, but since the model was trained to output the "uh" sound where it wasn't labelled, it randomly attached it to many words, especially those starting with vowels. The result is that it sounds like an extra syllable directly attached to the word, and making the word partly unrecognizable)
It's important that those "uh" and sounds are removed or at least labelled for the next iteration.
Other than that it's almost perfect!