File size: 3,724 Bytes
d5a231a
0c9e5de
 
e7373ef
 
 
 
0c9e5de
2d4ea80
0c9e5de
d5a231a
 
b84277e
 
 
48db720
b84277e
 
 
1d7d29f
 
 
 
 
b84277e
dcf43b7
1b49bf8
2d4ea80
1b49bf8
b84277e
 
 
 
73a8b03
b84277e
49f5be2
 
602c182
b84277e
 
 
 
 
 
 
 
 
 
 
49f5be2
b84277e
 
408df9e
 
b84277e
408df9e
 
b84277e
 
 
 
49f5be2
408df9e
b84277e
 
 
 
49f5be2
b84277e
 
569f61e
9f271d7
569f61e
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
base_model:
- parler-tts/parler-tts-mini-v1
datasets:
- amphion/Emilia-Dataset
language:
- en
library_name: transformers
license: cc-by-nc-sa-4.0
pipeline_tag: text-to-speech
---

# Parler-TTS Mini v1 ft. ParaSpeechCaps

We finetuned [parler-tts/parler-tts-mini-v1](https://huggingface.co/parler-tts/parler-tts-mini-v1) on our
[ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) dataset
to create a TTS model that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.)
with a textual style prompt ('*A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.*').

ParaSpeechCaps (PSC) is our large-scale dataset that provides rich style annotations for speech utterances,
supporting 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags.
It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled.
Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations
for such a wide variety of style tags for the first time.

Please take a look at our [paper](https://arxiv.org/abs/2503.04713), our [codebase](https://github.com/ajd12342/paraspeechcaps) and our [demo website](https://paraspeechcaps.github.io/) for more information.

**License:** [CC BY-NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)


## Usage

### Installation
This repository has been tested with Python 3.11 (`conda create -n paraspeechcaps python=3.11`), but most other versions should probably work.
```sh
git clone https://github.com/ajd12342/paraspeechcaps.git
cd paraspeechcaps/model/parler-tts
pip install -e .[train]
```

### Running Inference
```py
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
guidance_scale = 1.5

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
description_tokenizer = AutoTokenizer.from_pretrained(model_name)
transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()

input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)

generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)

audio_arr = generation.cpu().numpy().squeeze()
sf.write("output.wav", audio_arr, model.config.sampling_rate)
```

For a full inference script that includes ASR-based selection via repeated sampling and other scripts, refer to our [codebase](https://github.com/ajd12342/paraspeechcaps).

## Citation

If you use this model, the dataset or the repository, please cite our work as follows:
```bibtex
@misc{diwan2025scalingrichstylepromptedtexttospeech,
      title={Scaling Rich Style-Prompted Text-to-Speech Datasets}, 
      author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
      year={2025},
      eprint={2503.04713},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2503.04713}, 
}
```