SpeechT5 Fine-tuned for Telugu Text-to-Speech

This model is a fine-tuned version of microsoft/speecht5_tts specifically adapted for Telugu text-to-speech synthesis. The model was trained on the IndicTTS Telugu dataset and includes custom preprocessing pipelines for handling Telugu phonemes and morphologically rich text structures.

Model Description

This Telugu TTS model is built upon Microsoft's SpeechT5 architecture, which combines the strengths of both speech and text processing through a unified encoder-decoder framework. The model has been specifically fine-tuned to handle Telugu language characteristics including:

  • Complex phoneme structures unique to Telugu
  • Morphologically rich text patterns
  • Custom transliteration and phoneme conversion pipelines
  • Optimized token mappings for Telugu script

The fine-tuning process involved extensive exploratory data analysis (EDA) to identify and address phoneme imbalances in the Telugu dataset, resulting in improved model robustness and accuracy.

Intended Uses & Limitations

Intended Uses

  • Telugu Text-to-Speech: Convert Telugu text to natural-sounding speech
  • Accessibility Applications: Assist visually impaired users with Telugu content
  • Educational Tools: Language learning applications for Telugu
  • Content Creation: Generate Telugu voiceovers for multimedia content
  • Research: Academic research in Indian language speech synthesis

Limitations

  • Language Scope: Optimized specifically for Telugu; may not perform well on other languages
  • Data Dependency: Performance quality depends on the diversity of the training dataset
  • Computational Requirements: Requires significant computational resources for inference
  • Accent Variations: May not capture all regional Telugu accent variations
  • Technical Text: May struggle with technical terms, foreign words, or mixed-language content

Training and Evaluation Data

The model was trained on the IndicTTS Telugu dataset containing 8,576 high-quality Telugu audio samples. The dataset preprocessing included:

  • Custom Transliteration Pipeline: Developed specifically for Telugu script to phoneme conversion
  • Phoneme Balancing: EDA-driven approach to address phoneme distribution imbalances
  • Token Mapping Optimization: Refined mappings between Telugu characters and model tokens
  • Quality Filtering: Ensured high-quality audio-text pairs for training

The dataset represents a diverse range of Telugu speakers and content types, providing a solid foundation for general-purpose Telugu TTS applications.

Training Procedure

Training Hyperparameters

The model was trained with the following configuration optimized for Telugu language characteristics:

  • Learning Rate: 0.001
  • Training Batch Size: 4
  • Evaluation Batch Size: 2
  • Seed: 42
  • Gradient Accumulation Steps: 8
  • Total Training Batch Size: 32
  • Optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
  • Learning Rate Scheduler: Linear with 100 warmup steps
  • Total Training Steps: 1,000
  • Mixed Precision: Native AMP for efficient training

Training Results

The model showed consistent improvement throughout training, with validation loss decreasing from 0.6689 to 0.4496:

Training Loss Epoch Step Validation Loss
0.7785 0.4145 100 0.6689
0.8247 0.8290 200 0.7610
0.6961 1.2404 300 0.6406
0.6305 1.6549 400 0.5726
0.5784 2.0663 500 0.5422
0.5582 2.4808 600 0.5184
0.5399 2.8953 700 0.4992
0.5132 3.3067 800 0.4786
0.4903 3.7212 900 0.4617
0.4774 4.1326 1000 0.4496

Usage

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech
import torch

# Load the model and processor
processor = SpeechT5Processor.from_pretrained("your-username/speecht5_finetuned_telugu_charan")
model = SpeechT5ForTextToSpeech.from_pretrained("your-username/speecht5_finetuned_telugu_charan")

# Prepare Telugu text input
text = "మీ Telugu వాక్యం ఇక్కడ రాయండి"  # Write your Telugu sentence here
inputs = processor(text=text, return_tensors="pt")

# Generate speech
with torch.no_grad():
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings=None)

# The speech output can be saved or played as audio

Model Performance

  • Final Validation Loss: 0.4496
  • Training Convergence: Stable convergence achieved within 1,000 steps
  • Phoneme Accuracy: Improved through custom EDA and token mapping optimization
  • Speech Quality: Natural-sounding Telugu speech generation

Technical Details

Framework Versions

  • Transformers: 4.47.0
  • PyTorch: 2.5.1+cu121
  • Datasets: 3.3.1
  • Tokenizers: 0.21.0

Model Architecture

  • Base Model: Microsoft SpeechT5 TTS
  • Encoder: Multi-modal encoder handling both text and speech
  • Decoder: Autoregressive decoder for speech generation
  • Vocoder: Compatible with HiFi-GAN for high-quality audio output

Citation

If you use this model in your research, please cite:

@misc{speecht5_telugu_charan,
  title={SpeechT5 Fine-tuned for Telugu Text-to-Speech},
  author={Rama Charan Pisupati},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/speecht5_finetuned_telugu_charan}}
}

Acknowledgments

  • Microsoft Research for the original SpeechT5 architecture
  • IndicTTS team for providing the Telugu dataset
  • Hugging Face for the transformers library and model hosting platform

Contact

For questions or issues related to this model, please contact:

Downloads last month
120
Safetensors
Model size
144M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for Epikwhale/speecht5_finetuned_telugu_charan

Finetuned
(1185)
this model

Evaluation results