Text-to-Speech
ONNX
English

[STATUS] Jan 12 Forecast

#36
by hexgrad - opened

Jan 12: My intent is to supersede v0.19 with a better Kokoro model that dominates in every respect. To do this, I plan to continue training the unreleased v0.23 checkpoint on a richer data mix.

  • If successful, you should expect the next-gen Kokoro model to ship with more voices and languages, also under an Apache 2.0 license, with a similar 82M parameter architecture.
  • If unsuccessful, it would most likely be because the model does not converge, i.e. loss does not go down. That could be because of data quality issues, architecture limitations, overfitting on old data, underfitting on new data, etc. Rollbacks and model collapse are not unheard of in ML, but fingers crossed it does not happen here—or if they do, that I can address such issues should they come up.

Behind the scenes, slabs of data have been (and still are) coming in thanks to the outstanding community response to #21 and I am incredibly grateful for it. Some of these slabs are languages new to the model, which is exciting. Note that #21 is first-come-first-serve, and at some point I will not be able to airdrop your data into a GPU in the middle of a training run.

Most of my focus is now on organizing these slabs such that they can be dispatched to GPUs later. Training has not started yet, since data is still flowing in and much processing work remains. In the meantime, I may not be able to get to some of your questions, but please understand that is not without reason.

That's it for now, thanks everyone!

pedro.jpg

hexgrad pinned discussion

Keep up the amazing work! Kokoro is a godsend to those who have been waiting for a license-permissive high quality TTS model for so long.

This is inspiring work! You are singlehandedly changing the game. God bless and always follow your vision 🙏

Great to hear, question is there any chance that Koko will be able to eventually handle things like breath sounds, coughs, and those.... interupts... that normal speech have

not only is apache 2 a good call but im a huge fan of the 82m param size! Amazing work!!! ❤️❤️❤️

ps: could we have the updated discord server link, it says no longer working.

ps: could we have the updated discord server link, it says no longer working.

@shub1 The discord server link works, someone else had this issue earlier and said "Its a firefox problem refusing to open Discord" so maybe try another browser or switch to mobile.

Do you plan to include an emotion option. IE have the AI voice talk angrily or happily, etc.

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

@yukiarimo Gave it a fair shot. Exact copy and paste into https://huggingface.co/spaces/mrfakename/MeloTTS

Man, release the model quickly.

LMAO. Pre-trained models don’t count. If you have a professional, real studio-quality dataset, it will sound undistinguishable from human (AND THE NORMAL SAMPLING RATE, LIKE REALLY?????? (Yes, I know that most people are deaf, but it is really audible).

Plus, you said you generated the dataset with ElevenLabs and other TTSs? Well, it definitely sounds very robotic and non-human!

@FariqF Yeah, or at least the encoder part. Really, who really cares about your weight (basic users do not count)? It is not an LLM where it is just too expensive to "drop the Wikipedia and go;" it is a TTS!

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

It's so disrespectful to call someone's hard work “absolute trash”.
I like MeloTTS too, but I have to disagree with you.

Everything I don’t like I call so. For example, ElevenLabs: Good? Yes! Absolute trash? Yes! Why:

  1. Not open-source
  2. No real voice generation control
  3. Watermark. For everything
  4. Shitty ToS
  5. PVC verification
  6. No RVC-like conversion
  7. Can’t generate >5k at once
  8. Full WAV = X2 credits
  9. No training config / Limited data input
  10. Doesn’t sound natural enough

The decision to make a model open source ultimately rests with its owner or creator.

Yes, I know. Just add it to make a list of 10

  • It is not a big deal is you just share an architecture, but not the actual weights
  • It is not a big deal is you just share an architecture, but not the actual weights

@yukiarimo The architecture was already open sourced by Li et al:

Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Beyond that, some unsolicited advice: it would serve you well in life to be a little more respectful to people who give you free things. If you were to go to your hypothetical mother-in-law's house, eat a cooked meal, then spit in it and scream "It's not good enough! Make me another!" I don't think that would play out well for you.

LMAO. Pre-trained models don’t count. If you have a professional, real studio-quality dataset, it will sound undistinguishable from human (AND THE NORMAL SAMPLING RATE, LIKE REALLY?????? (Yes, I know that most people are deaf, but it is really audible).

Plus, you said you generated the dataset with ElevenLabs and other TTSs? Well, it definitely sounds very robotic and non-human!

@FariqF Yeah, or at least the encoder part. Really, who really cares about your weight (basic users do not count)? It is not an LLM where it is just too expensive to "drop the Wikipedia and go;" it is a TTS!

@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.

Kokoro is an absolute trash. I've already switched to MeloTTS, which can do my beloved 44.1kHz! Try it out yourself! Can run on CPU!

@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.

@yukiarimo The architecture was already open sourced by Li et al:

Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Hello, hexgrad,
First, I want to congratulate you on the amazing work you’ve done and thank you for making the TTS model you’re sharing with us open source!
I’d like to kindly ask if it’s possible for you to share the settings you use to train your models and also share the modified code, if that’s possible?
I want to train a model that speaks Bulgarian because most existing models currently sound very robotic and unnatural. I have high-quality audio and will do my best to train the model as well as I can.
Thank you in advance!

@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.

Because MeloTTS is 200M and this one is 84M, so if it is possible to make “equal sounding” TTSs, I would like to go with one which generates faster and uses less memory (assuming 1-1 identical generations)

@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.

I didn't! We have a voice actress in our headquarters who records stuff for datasets. All dataset are human spoken (even for LLMs, I've never used shared LLM-generated stuff)

@yukiarimo The architecture was already open sourced by Li et al. Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Yes, but:

  1. What about the unreleased encoder part? Where is it came from? Also StyleTTS2?
  2. StyleTTS2 is slow and needs a lot of data unlike Kokoro and Melo that can be trained with a few hours!

@yukiarimo If you have switched to MeloTTS why are you complaining about Kokoro. If MeloTTS works better for you then use it lol. Especially since it can run on CPU as well.

Because MeloTTS is 200M and this one is 84M, so if it is possible to make “equal sounding” TTSs, I would like to go with one which generates faster and uses less memory (assuming 1-1 identical generations)

@yukiarimo If you can find an apache-2.0 licensed, "real studio-quality dataset" then lmk cause those kinds of datasets are not easy to find.

I didn't! We have a voice actress in our headquarters who records stuff for datasets. All dataset are human spoken (even for LLMs, I've never used shared LLM-generated stuff)

@yukiarimo The architecture was already open sourced by Li et al. Although I did make slight modifications for training and inference, I do not claim that this is a novel architecture. You are more than welcome to train your own StyleTTS2 model using those resources above.

Yes, but:

  1. What about the unreleased encoder part? Where is it came from? Also StyleTTS2?
  2. StyleTTS2 is slow and needs a lot of data unlike Kokoro and Melo that can be trained with a few hours!

Then why don't you train Melo?

Already in the process. Just trying to experiment with different stuff to see what’s better !

Love how you degrade someone's hard work to the level of "absolute trash", point out that you have already switched, now you come back to say that yes, even though it's absolute trash, you still experiment with it because it gives you faster results using less memory. All this gives you wonderful credibility and reputation of course. Might I suggest you reevaluate what (or who) is absolute trash here?

Yes, of course! Do you want to know what is absolute trash here? That’s right -> it’s ElevenLabs who are putting their fucking watermark on every single clip!

Yes, of course! Do you want to know what is absolute trash here? That’s right -> it’s ElevenLabs who are putting their fucking watermark on every single clip!

Then go complain to eleven labs. At the end of the day hexgrad is the dev of the model. He can do what he wants with it. The code for style TTS 2 is available online and you can try to recreate what hex did.

Sign up or log in to comment