SNAC output quality with high-pitch speech

#23
by matthen - opened

This is more of an issue with the snac model-

I'm finding that for some high-pitched speech, the decoded output sounds like the voice is sort of cracking. Below I have attached an example waveform, the result of snac encode/decode, and a screenshot of the spectrograms.

So this is not using Orpheus, but just testing the snac model as a vocoder. But when I finetune Orpheus on waveforms like this, it ends up outputting this kind of artefact.

I'm curious if this is something that anyone has seen before? And if there are any ideas for how to improve performance, maybe some pre-processing on the waveform that could help?
I can also try to remove training examples that exceed a certain f0 pitch...

Thanks!

input file:

encoded then decoded with hubertsiuzdak/snac_24khz:

spectrograms- I circled the part in the decoded output where the formants are kind of disconnected:
image.png

Sign up or log in to comment