ParlerVoice

Professional Text-to-Speech by VoicingAI R&D Labs

ParlerVoice is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.

✨ Key Features

🏆 Extensive Training Data: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!)
👥 Comprehensive Speaker Library: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics
🎭 Advanced Expressiveness: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions
🔬 Technical Architecture: Advanced two-tokenizer system enabling both prompt-based and description-based generation
🌍 Multi-Accent Support: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents

Technical Specifications

Base Model: parler-tts/parler-tts-mini-v1.1
Training Data: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!)
Architecture: Two-tokenizer flow for enhanced control and consistency
Output Quality: 24kHz high-fidelity audio generation

📈 Technical Performance

Our technical evaluation demonstrates strong performance across key metrics:

🏆 Performance Benchmarks: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations
🔬 Architecture Studies: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines
⚖️ Comparative Analysis: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution
🌍 Dataset Quality: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!)

📊 View Full Technical Report & Audio Samples

🛠 Installation

# Install base dependencies
pip install git+https://github.com/huggingface/parler-tts.git

# Install ParlerVoice (for advanced features and presets)
pip install -r requirements.txt

💻 Usage

Quick Start with Transformers API

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the model
model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device)
prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "Hey, how are you doing today?"
description = (
    "Connor conveys a neutral mood through a professional and controlled delivery. "
    "He speaks with a slightly low pitch, adding subtle weight to his delivery. "
    "His pace is moderate, keeping the speech easy to follow. "
    "His voice is slightly expressive, with subtle emotional inflections. "
    "The recording is exceptionally clean and close-sounding."
)

desc_inputs = description_tokenizer(description, return_tensors="pt").to(device)
prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device)

gen = model.generate(
    input_ids=desc_inputs.input_ids,
    attention_mask=desc_inputs.attention_mask,
    prompt_input_ids=prompt_inputs.input_ids,
    prompt_attention_mask=prompt_inputs.attention_mask,
)

audio_arr = gen.cpu().numpy().squeeze()
sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate)

Advanced Usage with Speaker Presets (Recommended)

For best results, use the ParlerVoice inference engine from the GitHub repository:

from parlervoice_infer.engine import ParlerVoiceInference
from parlervoice_infer.config import GenerationConfig

# Initialize the engine
infer = ParlerVoiceInference(
    checkpoint_path="TieIncred/ParlerVoice",
    base_model_path="parler-tts/parler-tts-mini-v1.1",
)

# Generate with speaker preset
cfg = GenerationConfig()
audio, path = infer.generate_with_speaker_preset(
    prompt="Welcome to the future of voice AI!",
    speaker="Connor",  # Choose from 85 available speakers
    preset="professional",  # Options: casual, narration, dramatic, podcast, news_anchor
    config=cfg,
    output_path="welcome_voice.wav",
)

Maximum Control with Rich Descriptions

# For maximum control and consistency
desc = (
    "Connor conveys a confident, professional tone with a warm and engaging delivery. "
    "He speaks with a moderate pace, clear articulation, and subtle emotional warmth. "
    "His voice has a rich, resonant quality that commands attention while remaining approachable. "
    "The recording is clean and professional with minimal background noise."
)

audio, path = infer.generate_audio(
    prompt="Innovation in AI voice technology continues to push boundaries.",
    description=desc,
    output_path="innovative_voice.wav",
)

Command Line Interface

python -m parlervoice_infer \
  --checkpoint "TieIncred/ParlerVoice" \
  --prompt "Experience the next generation of voice synthesis!" \
  --speaker Connor \
  --preset dramatic \
  --output parlervoice_demo.wav

🗣️ Speaker Library

ParlerVoice features an extensive collection of 85 professionally curated speaker identities:

🇺🇸 American Speakers

Male: Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael

Female: Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia

🇬🇧 British Speakers

Oliver (Male)
Sophie (Female)

🇦🇺 Australian / New Zealand

Male: Liam, Finn
Female: Ruby, Emma, Chloe

🌍 International Accents

Connor (Male, Canadian)
Thabo (Male, South African)
Marco (Male, Italian)
Cian (Male, Irish)
Wei (Male, Chinese)
Aoife (Female, Irish)
Siobhan (Female, Irish)
Johan (Male, Dutch)
Pieter (Male, Dutch)
Ingrid (Female, Dutch)
Priya (Female, Indian)
Mei, Lin, Xiao, Li, Jing, Yan (Chinese)
Elena (Female, Spanish/European)

Full details in the technical documentation

⚡ Key Capabilities

🎭 Expressive Control

Natural Language Descriptions: Control emotion, tone, pace, and style through intuitive text descriptions
Real-time Adjustment: Modify expressiveness on-the-fly for dynamic content
Contextual Awareness: Maintains consistency across long-form content

🔊 Audio Quality

High-Fidelity Output: 24kHz crystal-clear audio reproduction
Noise Control: Advanced background noise and reverb management
Speaker Consistency: Maintains voice identity across different emotional states

🚀 Performance Optimizations

Efficient Inference: Optimized for both CPU and GPU deployment
Batch Processing: Handle multiple requests simultaneously
Streaming Support: Real-time audio generation capabilities
Compatible with SDPA and compile optimizations from upstream Parler-TTS

For optimization tips, see Parler-TTS INFERENCE.md

💡 Best Practices

Recommended Usage for Optimal Results

Use speaker presets from the repository for consistent, high-quality outputs
Include named speakers in descriptions to bias towards specific voice identities
Provide detailed descriptions for maximum control over expressiveness and tone
Pull latest updates from the repo as we actively refine description phrasing

Example Description Template

[Speaker Name] conveys a [emotion] mood through a [style] delivery. 
They speak with a [pitch level] pitch and [pace] pace. 
The voice is [expressiveness level], with [characteristics]. 
The recording is [quality level] with [background description].

📋 License

This project is licensed under the MIT License.

Open Source & Free to Use - ParlerVoice is available for:

✅ Commercial applications and services
✅ Academic research and educational purposes
✅ Personal projects and community contributions
✅ Integration into other products and services
✅ Modification and redistribution

📚 Citations

If you use this work, please consider citing:

@software{iqbal2025parlervoice,
  title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control},
  author={Tausif Iqbal and Zeeshan and Anant},
  year={2025},
  publisher={VoicingAI R\&D Labs},
  url={https://github.com/VoicingAI/ParlerVoice}
}

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

@misc{lyth2024natural,
  title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
  author={Dan Lyth and Simon King},
  year={2024},
  eprint={2402.01912},
  archivePrefix={arXiv},
  primaryClass={cs.SD}
}

🔗 Resources

📦 GitHub Repository: VoicingAI/ParlerVoice
📊 Technical Report & Samples: Notion Documentation
🤗 Hugging Face Model: TieIncred/ParlerVoice
🎯 Base Model: parler-tts/parler-tts-mini-v1.1

Made with ❤️ by VoicingAI R&D Labs

Principal Researcher: Tausif Iqbal

Core Team: Zeeshan • Anant

Developed at VoicingAI

Downloads last month: 61

Safetensors

Model size

0.9B params

Tensor type

F32

Model tree for TieIncred/ParlerVoice

Base model

parler-tts/parler-tts-mini-v1.1

Finetuned

(2)

this model