llama.cpp binary usage?
Ciao,is therea way to use the GGUF model directly with an API call from the llama-server? Or llama.cpp python is the only option?
It will work in the next version of koboldcpp
There is support for version 0.2, though some features like speaker loading and multilingual support aren’t implemented yet. You can process the text yourself for other languages and input it if needed. Version 0.3 isn’t supported currently due to some changes in prompt.
https://github.com/ggerganov/llama.cpp/pull/11070 (server POC) "./build/bin/llama-server --tts-oute-default -c 4096 --path ./examples/server/public_tts"
https://github.com/ggerganov/llama.cpp/tree/master/examples/tts
Btw @edwko check out upcoming koboldcpp to be released tomorrow https://github.com/LostRuins/koboldcpp/blob/concedo_experimental/otherarch/tts_adapter.cpp
Here, besides the guide tokens I've already demonstrated, I am using a technique to synthesize artificial unique speakers instead of simple voice cloning. Tested working with all versions 0.2 and 0.3 gguf. It's a 2 pass process that allows the creation of consistent speakers based on a user provided seed.
First pass:
- Create a simple prompt of about 5 words. In koboldcpp I use the neutral phrase "but that is what it is".
- It's processed to convert it into the input format.
<|im_start|>
<|text_start|>but<|space|>that<|space|>is<|space|>what<|space|>it<|space|>is<|space|>
<|text_end|>
<|audio_start|>
- "Creative" sampler settings are set, e.g. Top-K=20, Temp=1.2, this allows for the creation of a much wider variety of synthetic speakers. I did notice that 0.2 was strongly biased towards Male voices, while 0.3 was kind of split between Male and Female.
- Generate until
<|audio_end|>
. We use the guide token technique to ensure the correct words get generated. We cache this as a unique "speaker" referenced to the provided seed.
Second pass:
- "Reliable" sampler settings are used instead (Top-K=4, Temp=0.7), providing increased consistency.
- Now, when people wish to use this specific speaker, we prepend the "neutral phrase" to before their prompt, and then likewise prepend the "speaker voice texture" after their prompt
<|im_start|>
<|text_start|>but<|space|>that<|space|>is<|space|>what<|space|>it<|space|>is<|space|>hello<|space|>my<|space|>friend<|text_end|>
<|audio_start|>
but<|t_0.20|><|686|><|1288|><|1251|><|1428|>
...
is<|t_0.76|><|1748|><|1422|><|276|><|1337|><|1322|><|1519|><|1779|><|1067|><|1724|><|891|><|1205|><|1419|><|1144|><|1667|><|591|><|1003|><|1543|><|566|><|1390|><|426|><|1824|><|182|><|1138|><|52|><|129|><|1056|><|155|><|1056|><|1298|><|919|><|155|><|125|><|500|><|1022|><|571|><|315|><|400|><|100|><|617|><|295|><|757|><|324|><|592|><|1298|><|1310|><|57|><|876|><|1175|><|1353|><|1770|><|1649|><|1828|><|1637|><|362|><|1744|><|884|><|1027|><|space|>
- The resulting generated vocoder tokens will have the correct and consistent speaker characteristics for all outputs, allowing us to explore the model's latent space and obtain hundreds to thousands of unique but consistent speakers with simple numeric seeds.
@concedo I’ll definitely check it out! The idea of creating unique speakers like this is really interesting, can’t wait to test it. Do you plan to allow loading speakers from JSON at some point? That would be also useful, as we could also use existing cloned speakers. Since it’s now standardized in the library and independent of model versions, it saves both the transcription and WavTokenizer codes.
Btw @Henk717 was asking if you had a voice cloning guide to create new json speaker files
@Henk717 For speaker creation check out the guide in docs: https://github.com/edwko/OuteTTS/blob/main/docs/interface_v2_usage.md#speaker-creation-and-management