parler-tts-mini-v1-paraspeechcaps / README.md

Update README.md

48db720 verified 5 months ago

3.72 kB

	---
	base_model:
	- parler-tts/parler-tts-mini-v1
	datasets:
	- amphion/Emilia-Dataset
	language:
	- en
	library_name: transformers
	license: cc-by-nc-sa-4.0
	pipeline_tag: text-to-speech
	---

	# Parler-TTS Mini v1 ft. ParaSpeechCaps

	We finetuned [parler-tts/parler-tts-mini-v1](https://huggingface.co/parler-tts/parler-tts-mini-v1) on our
	[ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) dataset
	to create a TTS model that can generate speech while controlling for rich styles (pitch, rhythm, clarity, emotion, etc.)
	with a textual style prompt ('A male speaker's speech is distinguished by a slurred articulation, delivered at a measured pace in a clear environment.').

	ParaSpeechCaps (PSC) is our large-scale dataset that provides rich style annotations for speech utterances,
	supporting 59 style tags covering speaker-level intrinsic style tags and utterance-level situational style tags.
	It consists of a human-annotated subset ParaSpeechCaps-Base and a large automatically-annotated subset ParaSpeechCaps-Scaled.
	Our novel pipeline combining off-the-shelf text and speech embedders, classifiers and an audio language model allows us to automatically scale rich tag annotations
	for such a wide variety of style tags for the first time.

	Please take a look at our [paper](https://arxiv.org/abs/2503.04713), our [codebase](https://github.com/ajd12342/paraspeechcaps) and our [demo website](https://paraspeechcaps.github.io/) for more information.

	License: [CC BY-NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)


	## Usage

	### Installation
	This repository has been tested with Python 3.11 (`conda create -n paraspeechcaps python=3.11`), but most other versions should probably work.
	```sh
	git clone https://github.com/ajd12342/paraspeechcaps.git
	cd paraspeechcaps/model/parler-tts
	pip install -e .[train]
	```

	### Running Inference
	```py
	import torch
	from parler_tts import ParlerTTSForConditionalGeneration
	from transformers import AutoTokenizer
	import soundfile as sf

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	model_name = "ajd12342/parler-tts-mini-v1-paraspeechcaps"
	guidance_scale = 1.5

	model = ParlerTTSForConditionalGeneration.from_pretrained(model_name).to(device)
	description_tokenizer = AutoTokenizer.from_pretrained(model_name)
	transcription_tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

	input_description = "In a clear environment, a male voice speaks with a sad tone.".replace('\n', ' ').rstrip()
	input_transcription = "Was that your landlord?".replace('\n', ' ').rstrip()

	input_description_tokenized = description_tokenizer(input_description, return_tensors="pt").to(model.device)
	input_transcription_tokenized = transcription_tokenizer(input_transcription, return_tensors="pt").to(model.device)

	generation = model.generate(input_ids=input_description_tokenized.input_ids, prompt_input_ids=input_transcription_tokenized.input_ids, guidance_scale=guidance_scale)

	audio_arr = generation.cpu().numpy().squeeze()
	sf.write("output.wav", audio_arr, model.config.sampling_rate)
	```

	For a full inference script that includes ASR-based selection via repeated sampling and other scripts, refer to our [codebase](https://github.com/ajd12342/paraspeechcaps).

	## Citation

	If you use this model, the dataset or the repository, please cite our work as follows:
	```bibtex
	@misc{diwan2025scalingrichstylepromptedtexttospeech,
	title={Scaling Rich Style-Prompted Text-to-Speech Datasets},
	author={Anuj Diwan and Zhisheng Zheng and David Harwath and Eunsol Choi},
	year={2025},
	eprint={2503.04713},
	archivePrefix={arXiv},
	primaryClass={eess.AS},
	url={https://arxiv.org/abs/2503.04713},
	}
	```