VoicingAI Logo

ParlerVoice

Professional Text-to-Speech by VoicingAI R&D Labs

License: MIT Python 3.8+ Hugging Face

ParlerVoice is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.


✨ Key Features

  • πŸ† Extensive Training Data: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!)
  • πŸ‘₯ Comprehensive Speaker Library: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics
  • 🎭 Advanced Expressiveness: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions
  • πŸ”¬ Technical Architecture: Advanced two-tokenizer system enabling both prompt-based and description-based generation
  • 🌍 Multi-Accent Support: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents

Technical Specifications

  • Base Model: parler-tts/parler-tts-mini-v1.1
  • Training Data: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!)
  • Architecture: Two-tokenizer flow for enhanced control and consistency
  • Output Quality: 24kHz high-fidelity audio generation

πŸ“ˆ Technical Performance

Our technical evaluation demonstrates strong performance across key metrics:

  1. πŸ† Performance Benchmarks: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations

  2. πŸ”¬ Architecture Studies: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines

  3. βš–οΈ Comparative Analysis: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution

  4. 🌍 Dataset Quality: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!)

πŸ“Š View Full Technical Report & Audio Samples


πŸ›  Installation

# Install base dependencies
pip install git+https://github.com/huggingface/parler-tts.git

# Install ParlerVoice (for advanced features and presets)
pip install -r requirements.txt

πŸ’» Usage

Quick Start with Transformers API

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the model
model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device)
prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "Hey, how are you doing today?"
description = (
    "Connor conveys a neutral mood through a professional and controlled delivery. "
    "He speaks with a slightly low pitch, adding subtle weight to his delivery. "
    "His pace is moderate, keeping the speech easy to follow. "
    "His voice is slightly expressive, with subtle emotional inflections. "
    "The recording is exceptionally clean and close-sounding."
)

desc_inputs = description_tokenizer(description, return_tensors="pt").to(device)
prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device)

gen = model.generate(
    input_ids=desc_inputs.input_ids,
    attention_mask=desc_inputs.attention_mask,
    prompt_input_ids=prompt_inputs.input_ids,
    prompt_attention_mask=prompt_inputs.attention_mask,
)

audio_arr = gen.cpu().numpy().squeeze()
sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate)

Advanced Usage with Speaker Presets (Recommended)

For best results, use the ParlerVoice inference engine from the GitHub repository:

from parlervoice_infer.engine import ParlerVoiceInference
from parlervoice_infer.config import GenerationConfig

# Initialize the engine
infer = ParlerVoiceInference(
    checkpoint_path="TieIncred/ParlerVoice",
    base_model_path="parler-tts/parler-tts-mini-v1.1",
)

# Generate with speaker preset
cfg = GenerationConfig()
audio, path = infer.generate_with_speaker_preset(
    prompt="Welcome to the future of voice AI!",
    speaker="Connor",  # Choose from 85 available speakers
    preset="professional",  # Options: casual, narration, dramatic, podcast, news_anchor
    config=cfg,
    output_path="welcome_voice.wav",
)

Maximum Control with Rich Descriptions

# For maximum control and consistency
desc = (
    "Connor conveys a confident, professional tone with a warm and engaging delivery. "
    "He speaks with a moderate pace, clear articulation, and subtle emotional warmth. "
    "His voice has a rich, resonant quality that commands attention while remaining approachable. "
    "The recording is clean and professional with minimal background noise."
)

audio, path = infer.generate_audio(
    prompt="Innovation in AI voice technology continues to push boundaries.",
    description=desc,
    output_path="innovative_voice.wav",
)

Command Line Interface

python -m parlervoice_infer \
  --checkpoint "TieIncred/ParlerVoice" \
  --prompt "Experience the next generation of voice synthesis!" \
  --speaker Connor \
  --preset dramatic \
  --output parlervoice_demo.wav

πŸ—£οΈ Speaker Library

ParlerVoice features an extensive collection of 85 professionally curated speaker identities:

πŸ‡ΊπŸ‡Έ American Speakers

Male: Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael

Female: Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia

πŸ‡¬πŸ‡§ British Speakers

  • Oliver (Male)
  • Sophie (Female)

πŸ‡¦πŸ‡Ί Australian / New Zealand

  • Male: Liam, Finn
  • Female: Ruby, Emma, Chloe

🌍 International Accents

  • Connor (Male, Canadian)
  • Thabo (Male, South African)
  • Marco (Male, Italian)
  • Cian (Male, Irish)
  • Wei (Male, Chinese)
  • Aoife (Female, Irish)
  • Siobhan (Female, Irish)
  • Johan (Male, Dutch)
  • Pieter (Male, Dutch)
  • Ingrid (Female, Dutch)
  • Priya (Female, Indian)
  • Mei, Lin, Xiao, Li, Jing, Yan (Chinese)
  • Elena (Female, Spanish/European)

Full details in the technical documentation


⚑ Key Capabilities

🎭 Expressive Control

  • Natural Language Descriptions: Control emotion, tone, pace, and style through intuitive text descriptions
  • Real-time Adjustment: Modify expressiveness on-the-fly for dynamic content
  • Contextual Awareness: Maintains consistency across long-form content

πŸ”Š Audio Quality

  • High-Fidelity Output: 24kHz crystal-clear audio reproduction
  • Noise Control: Advanced background noise and reverb management
  • Speaker Consistency: Maintains voice identity across different emotional states

πŸš€ Performance Optimizations

  • Efficient Inference: Optimized for both CPU and GPU deployment
  • Batch Processing: Handle multiple requests simultaneously
  • Streaming Support: Real-time audio generation capabilities
  • Compatible with SDPA and compile optimizations from upstream Parler-TTS

For optimization tips, see Parler-TTS INFERENCE.md


πŸ’‘ Best Practices

Recommended Usage for Optimal Results

  • Use speaker presets from the repository for consistent, high-quality outputs
  • Include named speakers in descriptions to bias towards specific voice identities
  • Provide detailed descriptions for maximum control over expressiveness and tone
  • Pull latest updates from the repo as we actively refine description phrasing

Example Description Template

[Speaker Name] conveys a [emotion] mood through a [style] delivery. 
They speak with a [pitch level] pitch and [pace] pace. 
The voice is [expressiveness level], with [characteristics]. 
The recording is [quality level] with [background description].

πŸ“‹ License

This project is licensed under the MIT License.

Open Source & Free to Use - ParlerVoice is available for:

  • βœ… Commercial applications and services
  • βœ… Academic research and educational purposes
  • βœ… Personal projects and community contributions
  • βœ… Integration into other products and services
  • βœ… Modification and redistribution

πŸ“š Citations

If you use this work, please consider citing:

@software{iqbal2025parlervoice,
  title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control},
  author={Tausif Iqbal and Zeeshan and Anant},
  year={2025},
  publisher={VoicingAI R\&D Labs},
  url={https://github.com/VoicingAI/ParlerVoice}
}

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

@misc{lyth2024natural,
  title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
  author={Dan Lyth and Simon King},
  year={2024},
  eprint={2402.01912},
  archivePrefix={arXiv},
  primaryClass={cs.SD}
}

πŸ”— Resources


Made with ❀️ by VoicingAI R&D Labs

Principal Researcher: Tausif Iqbal

Core Team: Zeeshan β€’ Anant

Developed at VoicingAI

Downloads last month
61
Safetensors
Model size
0.9B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TieIncred/ParlerVoice

Finetuned
(2)
this model