File size: 3,724 Bytes
d5a231a 0c9e5de e7373ef 0c9e5de 2d4ea80 0c9e5de d5a231a b84277e 48db720 b84277e 1d7d29f b84277e dcf43b7 1b49bf8 2d4ea80 1b49bf8 b84277e 73a8b03 b84277e 49f5be2 602c182 b84277e 49f5be2 b84277e 408df9e b84277e 408df9e b84277e 49f5be2 408df9e b84277e 49f5be2 b84277e 569f61e 9f271d7 569f61e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
base_model:
- parler-tts/parler-tts-mini-v1
datasets:
- amphion/Emilia-Dataset
language:
- en
library_name: transformers
license: cc-by-nc-sa-4.0
pipeline_tag: text-to-speech
---
# Parler-TTS Mini v1 ft. ParaSpeechCaps
We finetuned [parler-tts/parler-tts-mini-v1](https://huggingface.co/parler-tts/parler-tts-mini-v1) on our
[ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) dataset
to create a TTS model that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.)
with a textual style prompt ('*A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.*').
ParaSpeechCaps (PSC) is our large-scale dataset that provides rich style annotations for speech utterances,
supporting 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags.
It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled.
Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations
for such a wide variety of style tags for the first time.
Please take a look at our [paper](https://arxiv.org/abs/2503.04713), our [codebase](https://github.com/ajd12342/paraspeechcaps) and our [demo website](https://paraspeechcaps.github.io/) for more information.
**License:** [CC BY-NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
## Usage
### Installation
This repository has been tested with Python 3.11 (`conda create -n paraspeechcaps python=3.11`), but most other versions should probably work.
```sh
git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]
```
### Running Inference
```py
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
guidance_scale = 1.5
model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()
input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)
generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)
```
For a full inference script that includes ASR-based selection via repeated sampling and other scripts, refer to our [codebase](https://github.com/ajd12342/paraspeechcaps).
## Citation
If you use this model, the dataset or the repository, please cite our work as follows:
```bibtex
@misc{diwan2025scalingrichstylepromptedtexttospeech,
title={Scaling Rich Style-Prompted Text-to-Speech Datasets},
author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
year={2025},
eprint={2503.04713},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2503.04713},
}
```
|