SpeechT5 Fine-tuned for Telugu Text-to-Speech
This model is a fine-tuned version of microsoft/speecht5_tts specifically adapted for Telugu text-to-speech synthesis. The model was trained on the IndicTTS Telugu dataset and includes custom preprocessing pipelines for handling Telugu phonemes and morphologically rich text structures.
Model Description
This Telugu TTS model is built upon Microsoft's SpeechT5 architecture, which combines the strengths of both speech and text processing through a unified encoder-decoder framework. The model has been specifically fine-tuned to handle Telugu language characteristics including:
- Complex phoneme structures unique to Telugu
- Morphologically rich text patterns
- Custom transliteration and phoneme conversion pipelines
- Optimized token mappings for Telugu script
The fine-tuning process involved extensive exploratory data analysis (EDA) to identify and address phoneme imbalances in the Telugu dataset, resulting in improved model robustness and accuracy.
Intended Uses & Limitations
Intended Uses
- Telugu Text-to-Speech: Convert Telugu text to natural-sounding speech
- Accessibility Applications: Assist visually impaired users with Telugu content
- Educational Tools: Language learning applications for Telugu
- Content Creation: Generate Telugu voiceovers for multimedia content
- Research: Academic research in Indian language speech synthesis
Limitations
- Language Scope: Optimized specifically for Telugu; may not perform well on other languages
- Data Dependency: Performance quality depends on the diversity of the training dataset
- Computational Requirements: Requires significant computational resources for inference
- Accent Variations: May not capture all regional Telugu accent variations
- Technical Text: May struggle with technical terms, foreign words, or mixed-language content
Training and Evaluation Data
The model was trained on the IndicTTS Telugu dataset containing 8,576 high-quality Telugu audio samples. The dataset preprocessing included:
- Custom Transliteration Pipeline: Developed specifically for Telugu script to phoneme conversion
- Phoneme Balancing: EDA-driven approach to address phoneme distribution imbalances
- Token Mapping Optimization: Refined mappings between Telugu characters and model tokens
- Quality Filtering: Ensured high-quality audio-text pairs for training
The dataset represents a diverse range of Telugu speakers and content types, providing a solid foundation for general-purpose Telugu TTS applications.
Training Procedure
Training Hyperparameters
The model was trained with the following configuration optimized for Telugu language characteristics:
- Learning Rate: 0.001
- Training Batch Size: 4
- Evaluation Batch Size: 2
- Seed: 42
- Gradient Accumulation Steps: 8
- Total Training Batch Size: 32
- Optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
- Learning Rate Scheduler: Linear with 100 warmup steps
- Total Training Steps: 1,000
- Mixed Precision: Native AMP for efficient training
Training Results
The model showed consistent improvement throughout training, with validation loss decreasing from 0.6689 to 0.4496:
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
0.7785 | 0.4145 | 100 | 0.6689 |
0.8247 | 0.8290 | 200 | 0.7610 |
0.6961 | 1.2404 | 300 | 0.6406 |
0.6305 | 1.6549 | 400 | 0.5726 |
0.5784 | 2.0663 | 500 | 0.5422 |
0.5582 | 2.4808 | 600 | 0.5184 |
0.5399 | 2.8953 | 700 | 0.4992 |
0.5132 | 3.3067 | 800 | 0.4786 |
0.4903 | 3.7212 | 900 | 0.4617 |
0.4774 | 4.1326 | 1000 | 0.4496 |
Usage
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech
import torch
# Load the model and processor
processor = SpeechT5Processor.from_pretrained("your-username/speecht5_finetuned_telugu_charan")
model = SpeechT5ForTextToSpeech.from_pretrained("your-username/speecht5_finetuned_telugu_charan")
# Prepare Telugu text input
text = "మీ Telugu వాక్యం ఇక్కడ రాయండి" # Write your Telugu sentence here
inputs = processor(text=text, return_tensors="pt")
# Generate speech
with torch.no_grad():
speech = model.generate_speech(inputs["input_ids"], speaker_embeddings=None)
# The speech output can be saved or played as audio
Model Performance
- Final Validation Loss: 0.4496
- Training Convergence: Stable convergence achieved within 1,000 steps
- Phoneme Accuracy: Improved through custom EDA and token mapping optimization
- Speech Quality: Natural-sounding Telugu speech generation
Technical Details
Framework Versions
- Transformers: 4.47.0
- PyTorch: 2.5.1+cu121
- Datasets: 3.3.1
- Tokenizers: 0.21.0
Model Architecture
- Base Model: Microsoft SpeechT5 TTS
- Encoder: Multi-modal encoder handling both text and speech
- Decoder: Autoregressive decoder for speech generation
- Vocoder: Compatible with HiFi-GAN for high-quality audio output
Citation
If you use this model in your research, please cite:
@misc{speecht5_telugu_charan,
title={SpeechT5 Fine-tuned for Telugu Text-to-Speech},
author={Rama Charan Pisupati},
year={2025},
howpublished={\url{https://huggingface.co/your-username/speecht5_finetuned_telugu_charan}}
}
Acknowledgments
- Microsoft Research for the original SpeechT5 architecture
- IndicTTS team for providing the Telugu dataset
- Hugging Face for the transformers library and model hosting platform
Contact
For questions or issues related to this model, please contact:
- Email: [email protected]
- GitHub: Epik-Whale463
- Downloads last month
- 120
Model tree for Epikwhale/speecht5_finetuned_telugu_charan
Base model
microsoft/speecht5_tts