|
--- |
|
base_model: |
|
- parler-tts/parler-tts-mini-v1 |
|
datasets: |
|
- amphion/Emilia-Dataset |
|
language: |
|
- en |
|
library_name: transformers |
|
license: cc-by-nc-sa-4.0 |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
# Parler-TTS Mini v1 ft. ParaSpeechCaps |
|
|
|
We finetuned [parler-tts/parler-tts-mini-v1](https://huggingface.co/parler-tts/parler-tts-mini-v1) on our |
|
[ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) dataset |
|
to create a TTS model that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.) |
|
with a textual style prompt ('*A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.*'). |
|
|
|
ParaSpeechCaps (PSC) is our large-scale dataset that provides rich style annotations for speech utterances, |
|
supporting 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags. |
|
It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled. |
|
Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations |
|
for such a wide variety of style tags for the first time. |
|
|
|
Please take a look at our [paper](https://arxiv.org/abs/2503.04713), our [codebase](https://github.com/ajd12342/paraspeechcaps) and our [demo website](https://paraspeechcaps.github.io/) for more information. |
|
|
|
**License:** [CC BY-NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) |
|
|
|
|
|
## Usage |
|
|
|
### Installation |
|
This repository has been tested with Python 3.11 (`conda create -n paraspeechcaps python=3.11`), but most other versions should probably work. |
|
```sh |
|
git clone https://github.com/ajd12342/paraspeechcaps.git |
|
cd paraspeechcaps/model/parler-tts |
|
pip install -e .[train] |
|
``` |
|
|
|
### Running Inference |
|
```py |
|
import torch |
|
from parler_tts import ParlerTTSForConditionalGeneration |
|
from transformers import AutoTokenizer |
|
import soundfile as sf |
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps" |
|
guidance_scale = 1.5 |
|
|
|
model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device) |
|
description_tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left") |
|
|
|
input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip() |
|
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip() |
|
|
|
input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device) |
|
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device) |
|
|
|
generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale) |
|
|
|
audio_arr = generation.cpu().numpy().squeeze() |
|
sf.write("output.wav", audio_arr, model.config.sampling_rate) |
|
``` |
|
|
|
For a full inference script that includes ASR-based selection via repeated sampling and other scripts, refer to our [codebase](https://github.com/ajd12342/paraspeechcaps). |
|
|
|
## Citation |
|
|
|
If you use this model, the dataset or the repository, please cite our work as follows: |
|
```bibtex |
|
@misc{diwan2025scalingrichstylepromptedtexttospeech, |
|
title={Scaling Rich Style-Prompted Text-to-Speech Datasets}, |
|
author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi}, |
|
year={2025}, |
|
eprint={2503.04713}, |
|
archivePrefix={arXiv}, |
|
primaryClass={eess.AS}, |
|
url={https://arxiv.org/abs/2503.04713}, |
|
} |
|
``` |
|
|