xTTS-v2-wolof / README.md

Update README.md

bb35b1f verified 3 months ago

6.48 kB

	---
	datasets:
	- galsenai/anta_women_tts
	language:
	- wo
	base_model:
	- coqui/XTTS-v2
	tags:
	- nlp
	- tts
	- speech
	---

	# Wolof Text To Speech
	This is a text-to-speech model allowing you to create a synthetic voice speaking in `Wolof` from any textual input in the same language. The model is based on [xTTS V2](https://huggingface.co/coqui/XTTS-v2) and has been trained on [Wolof-TTS](https://huggingface.co/datasets/galsenai/anta_women_tts/) data cleaned by the [GalsenAI Lab](https://huggingface.co/galsenai).

	## Checkpoint ID
	To download the model, you'll need the [gdown](https://github.com/wkentaro/gdown) utility included in the [Git project](https://github.com/Galsenaicommunity/Wolof-TTS) dependencies and the model ID indicated in the [checkpoint-id](checkpoint-id.yml) yaml file (cf the `Files and versions` section above).
	Then, use the command below to download the model checkpoint:
	```
	gdown <Checkpoint ID>
	```

	## Usage
	### Configurations
	Start by cloning the project:
	```sh
	git clone https://github.com/Galsenaicommunity/Wolof-TTS.git
	```
	Then, install the dependencies:
	```sh
	cd Wolof-TTS/notebooks/Models/xTTS\ v2
	pip install -r requirements.txt
	```
	> `IMPORTANT`: You don't need to install the TTS library, [a modified version](https://github.com/anhnh2002/XTTSv2-Finetuning-for-New-Languages/tree/main) is already included in the project's git repository.
	>
	You can now download the model checkpoint with `gdown` as indicated previously and unzip it:
	```sh
	unzip galsenai-xtts-wo-checkpoints.zip && rm galsenai-xtts-wo-checkpoints.zip
	```
	> Attention: The model is over 7GB in size.

	### Model Loading
	```py
	import torch, torchaudio, os
	import numpy as np

	from tqdm import tqdm
	from TTS.tts.configs.xtts_config import XttsConfig
	from TTS.tts.models.xtts import Xtts

	root_path = "../../../../galsenai-xtts-wo-checkpoints/"
	checkpoint_path = root_path+"Anta_GPT_XTTS_Wo"
	model_path = "best_model_89250.pth"

	device = "cuda:0" if torch.cuda.is_available() else "cpu"

	xtts_checkpoint = os.path.join(checkpoint_path, model_path)
	xtts_config = os.path.join(checkpoint_path,"config.json")
	xtts_vocab = root_path+"XTTS_v2.0_original_model_files/vocab.json"

	# Load model
	config = XttsConfig()
	config.load_json(xtts_config)
	XTTS_MODEL = Xtts.init_from_config(config)
	XTTS_MODEL.load_checkpoint(config,
	checkpoint_path = xtts_checkpoint,
	vocab_path = xtts_vocab,
	use_deepspeed = False)
	XTTS_MODEL.to(device)

	print("Model loaded successfully!")
	```

	### Model Inference
	xTTS can clone any voice with a sample length of just 6s. An audio sample from the training set is used as a `reference` and therefore as the output voice of the TTS.
	You can change it to any voice you wish, as long as you comply with data protection regulations.
	> Any use contrary to Senegalese law is strictly forbidden, and GalsenAI accepts no liability in such cases.
	> By using this model, you agree to comply with Senegalese laws and not to make any use that could cause any abuse or damage to anyone.
	```py
	from IPython.display import Audio
	# Sample audio of the voice that will be used by the TTS
	# You can change it and put any audio of at least 6s duration
	reference = root_path+"anta_sample.wav"
	Audio(reference, rate=44100)
	```
	Synthetic voice generation from a`text`:
	```py
	text = "Màngi tuddu Aadama, di baat bii waa Galsen A.I defar ngir wax ak yéen ci wolof!"

	gpt_cond_latent, speaker_embedding = XTTS_MODEL.get_conditioning_latents(
	audio_path = [reference],
	gpt_cond_len = XTTS_MODEL.config.gpt_cond_len,
	max_ref_length = XTTS_MODEL.config.max_ref_len,
	sound_norm_refs = XTTS_MODEL.config.sound_norm_refs)

	result = XTTS_MODEL.inference(
	text = text.lower(),
	gpt_cond_latent = gpt_cond_latent,
	speaker_embedding = speaker_embedding,
	do_sample = False,
	speed = 1.06,
	language = "wo",
	enable_text_splitting=True
	)
	```
	You can then export the output audio:
	```py
	import soundfile as sf

	generated_audio = "generated_audio.wav"
	sf.write(generated_audio, audio_signal, sample_rate)
	```
	A notebook is available on [this link](https://colab.research.google.com/drive/1AAhAtWyFjGpLGWrXaeK04BWc1BlIkNBf?usp=sharing), enabling you to test the model quickly.

	## LIMITATIONS
	The model was trained on the [Cleaned Wolof-TTS data](https://huggingface.co/datasets/galsenai/anta_women_tts/), which includes pauses during recording. This behavior is reflected in the final model, and pauses may occur randomly during inference.
	To remedy this, you can use the `removesilence.py` wrapper included in the repository to remove certain silences and mitigate this problem.
	```py
	from removesilence import detect_silence, remove_silence

	# silence identification
	lst = detect_silence(generated_audio)
	print(lst)

	# silence removing
	output_audio = "audio_without_silence.wav"
	remove_silence(generated_audio, lst, output_audio)
	```
	As the dataset used contains almost no French or English terms, the model will have difficulty correctly synthesizing a voice with [code-mixed](https://en.wikipedia.org/wiki/Code-mixing) text; the same goes for numbers.

	## ACKNOWLEDGEMENT
	This work was made possible thanks to the computational support of [Caytu Robotics](https://caytu.com/).
	GalsenAI disclaims all liability for any use of this voice synthesizer in contravention of the regulations governing the protection of personal data and all laws in force in Senegal.
	__Please mention GalsenAI on all source code, deposits and communications when using this tool.__

	If you have any questions, please contact us at `contact[at]galsen[dot]ai`.

	## CREDITS
	* The [raw data](https://huggingface.co/datasets/galsenai/wolof_tts) has been organised and made available by [Alwaly](https://huggingface.co/Alwaly).
	* The [training notebook](https://github.com/Galsenaicommunity/Wolof-TTS/blob/main/notebooks/Models/xTTS%20v2/xTTS_v2_fine_tunnig_on_single_wolof_tts_dataset.ipynb) was set up by [Mouhamed Sarr (Loloskii)](https://github.com/mohaskii).
	* The model training on [GCP](https://cloud.google.com/) (`A100 40GB`), the implementation of the silence suppression script (based on [this article](https://onkar-patil.medium.com/how-to-remove-silence-from-an-audio-using-python-50fd2c00557d)) as well as that of this notebook was carried out by [Derguene](https://huggingface.co/derguene).