|
--- |
|
datasets: |
|
- galsenai/anta_women_tts |
|
language: |
|
- wo |
|
base_model: |
|
- coqui/XTTS-v2 |
|
tags: |
|
- nlp |
|
- tts |
|
- speech |
|
--- |
|
|
|
# Wolof Text To Speech |
|
This is a text-to-speech model allowing you to create a synthetic voice speaking in `Wolof` from any textual input in the same language. The model is based on [xTTS V2](https://huggingface.co/coqui/XTTS-v2) and has been trained on [Wolof-TTS](https://huggingface.co/datasets/galsenai/anta_women_tts/) data cleaned by the [GalsenAI Lab](https://huggingface.co/galsenai). |
|
|
|
## Checkpoint ID |
|
To download the model, you'll need the [gdown](https://github.com/wkentaro/gdown) utility included in the [Git project](https://github.com/Galsenaicommunity/Wolof-TTS) dependencies and the model ID indicated in the [checkpoint-id](checkpoint-id.yml) yaml file (cf the `Files and versions` section above). |
|
Then, use the command below to download the model checkpoint: |
|
``` |
|
gdown <Checkpoint ID> |
|
``` |
|
|
|
## Usage |
|
### Configurations |
|
Start by cloning the project: |
|
```sh |
|
git clone https://github.com/Galsenaicommunity/Wolof-TTS.git |
|
``` |
|
Then, install the dependencies: |
|
```sh |
|
cd Wolof-TTS/notebooks/Models/xTTS\ v2 |
|
pip install -r requirements.txt |
|
``` |
|
> `IMPORTANT`: You don't need to install the TTS library, [a modified version](https://github.com/anhnh2002/XTTSv2-Finetuning-for-New-Languages/tree/main) is already included in the project's git repository. |
|
> |
|
You can now download the model checkpoint with `gdown` as indicated previously and unzip it: |
|
```sh |
|
unzip galsenai-xtts-wo-checkpoints.zip && rm galsenai-xtts-wo-checkpoints.zip |
|
``` |
|
> Attention: The model is over 7GB in size. |
|
|
|
### Model Loading |
|
```py |
|
import torch, torchaudio, os |
|
import numpy as np |
|
|
|
from tqdm import tqdm |
|
from TTS.tts.configs.xtts_config import XttsConfig |
|
from TTS.tts.models.xtts import Xtts |
|
|
|
root_path = "../../../../galsenai-xtts-wo-checkpoints/" |
|
checkpoint_path = root_path+"Anta_GPT_XTTS_Wo" |
|
model_path = "best_model_89250.pth" |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
|
xtts_checkpoint = os.path.join(checkpoint_path, model_path) |
|
xtts_config = os.path.join(checkpoint_path,"config.json") |
|
xtts_vocab = root_path+"XTTS_v2.0_original_model_files/vocab.json" |
|
|
|
# Load model |
|
config = XttsConfig() |
|
config.load_json(xtts_config) |
|
XTTS_MODEL = Xtts.init_from_config(config) |
|
XTTS_MODEL.load_checkpoint(config, |
|
checkpoint_path = xtts_checkpoint, |
|
vocab_path = xtts_vocab, |
|
use_deepspeed = False) |
|
XTTS_MODEL.to(device) |
|
|
|
print("Model loaded successfully!") |
|
``` |
|
|
|
### Model Inference |
|
xTTS can clone any voice with a sample length of just 6s. An audio sample from the training set is used as a `reference` and therefore as the output voice of the TTS. |
|
You can change it to any voice you wish, as long as you comply with data protection regulations. |
|
> Any use contrary to Senegalese law is strictly forbidden, and GalsenAI accepts no liability in such cases. |
|
> By using this model, you agree to comply with Senegalese laws and not to make any use that could cause any abuse or damage to anyone. |
|
```py |
|
from IPython.display import Audio |
|
# Sample audio of the voice that will be used by the TTS |
|
# You can change it and put any audio of at least 6s duration |
|
reference = root_path+"anta_sample.wav" |
|
Audio(reference, rate=44100) |
|
``` |
|
Synthetic voice generation from a`text`: |
|
```py |
|
text = "Màngi tuddu Aadama, di baat bii waa Galsen A.I defar ngir wax ak yéen ci wolof!" |
|
|
|
gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents( |
|
audio_path = [reference], |
|
gpt_cond_len = XTTS_MODEL.config.gpt_cond_len, |
|
max_ref_length = XTTS_MODEL.config.max_ref_len, |
|
sound_norm_refs = XTTS_MODEL.config.sound_norm_refs) |
|
|
|
result = XTTS_MODEL.inference( |
|
text = text.lower(), |
|
gpt_cond_latent = gpt_cond_latent, |
|
speaker_embedding = speaker_embedding, |
|
do_sample = False, |
|
speed = 1.06, |
|
language = "wo", |
|
enable_text_splitting=True |
|
) |
|
``` |
|
You can then export the output audio: |
|
```py |
|
import soundfile as sf |
|
|
|
generated_audio = "generated_audio.wav" |
|
sf.write(generated_audio, audio_signal, sample_rate) |
|
``` |
|
A notebook is available on [this link](https://colab.research.google.com/drive/1AAhAtWyFjGpLGWrXaeK04BWc1BlIkNBf?usp=sharing), enabling you to test the model quickly. |
|
|
|
## LIMITATIONS |
|
The model was trained on the [Cleaned Wolof-TTS data](https://huggingface.co/datasets/galsenai/anta_women_tts/), which includes pauses during recording. This behavior is reflected in the final model, and pauses may occur randomly during inference. |
|
To remedy this, you can use the `removesilence.py` wrapper included in the repository to remove certain silences and mitigate this problem. |
|
```py |
|
from removesilence import detect_silence, remove_silence |
|
|
|
# silence identification |
|
lst = detect_silence(generated_audio) |
|
print(lst) |
|
|
|
# silence removing |
|
output_audio = "audio_without_silence.wav" |
|
remove_silence(generated_audio, lst, output_audio) |
|
``` |
|
As the dataset used contains almost no French or English terms, the model will have difficulty correctly synthesizing a voice with [code-mixed](https://en.wikipedia.org/wiki/Code-mixing) text; the same goes for numbers. |
|
|
|
## ACKNOWLEDGEMENT |
|
This work was made possible thanks to the computational support of [Caytu Robotics](https://caytu.com/). |
|
GalsenAI disclaims all liability for any use of this voice synthesizer in contravention of the regulations governing the protection of personal data and all laws in force in Senegal. |
|
__Please mention GalsenAI on all source code, deposits and communications when using this tool.__ |
|
|
|
If you have any questions, please contact us at `contact[at]galsen[dot]ai`. |
|
|
|
## CREDITS |
|
* The [raw data](https://huggingface.co/datasets/galsenai/wolof_tts) has been organised and made available by [Alwaly](https://huggingface.co/Alwaly). |
|
* The [training notebook](https://github.com/Galsenaicommunity/Wolof-TTS/blob/main/notebooks/Models/xTTS%20v2/xTTS_v2_fine_tunnig_on_single_wolof_tts_dataset.ipynb) was set up by [Mouhamed Sarr (Loloskii)](https://github.com/mohaskii). |
|
* The model training on [GCP](https://cloud.google.com/) (`A100 40GB`), the implementation of the silence suppression script (based on [this article](https://onkar-patil.medium.com/how-to-remove-silence-from-an-audio-using-python-50fd2c00557d)) as well as that of this notebook was carried out by [Derguene](https://huggingface.co/derguene). |