SpeechT5 Fine-tuned for Telugu Text-to-Speech

This model is a fine-tuned version of microsoft/speecht5_tts specifically adapted for Telugu text-to-speech synthesis. The model was trained on the IndicTTS Telugu dataset and includes custom preprocessing pipelines for handling Telugu phonemes and morphologically rich text structures.

Model Description

This Telugu TTS model is built upon Microsoft's SpeechT5 architecture, which combines the strengths of both speech and text processing through a unified encoder-decoder framework. The model has been specifically fine-tuned to handle Telugu language characteristics including:

Complex phoneme structures unique to Telugu
Morphologically rich text patterns
Custom transliteration and phoneme conversion pipelines
Optimized token mappings for Telugu script

The fine-tuning process involved extensive exploratory data analysis (EDA) to identify and address phoneme imbalances in the Telugu dataset, resulting in improved model robustness and accuracy.

Intended Uses & Limitations

Intended Uses

Telugu Text-to-Speech: Convert Telugu text to natural-sounding speech
Accessibility Applications: Assist visually impaired users with Telugu content
Educational Tools: Language learning applications for Telugu
Content Creation: Generate Telugu voiceovers for multimedia content
Research: Academic research in Indian language speech synthesis

Limitations

Language Scope: Optimized specifically for Telugu; may not perform well on other languages
Data Dependency: Performance quality depends on the diversity of the training dataset
Computational Requirements: Requires significant computational resources for inference
Accent Variations: May not capture all regional Telugu accent variations
Technical Text: May struggle with technical terms, foreign words, or mixed-language content

Training and Evaluation Data

The model was trained on the IndicTTS Telugu dataset containing 8,576 high-quality Telugu audio samples. The dataset preprocessing included:

Custom Transliteration Pipeline: Developed specifically for Telugu script to phoneme conversion
Phoneme Balancing: EDA-driven approach to address phoneme distribution imbalances
Token Mapping Optimization: Refined mappings between Telugu characters and model tokens
Quality Filtering: Ensured high-quality audio-text pairs for training

The dataset represents a diverse range of Telugu speakers and content types, providing a solid foundation for general-purpose Telugu TTS applications.

Training Procedure

Training Hyperparameters

The model was trained with the following configuration optimized for Telugu language characteristics:

Learning Rate: 0.001
Training Batch Size: 4
Evaluation Batch Size: 2
Seed: 42
Gradient Accumulation Steps: 8
Total Training Batch Size: 32
Optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
Learning Rate Scheduler: Linear with 100 warmup steps
Total Training Steps: 1,000
Mixed Precision: Native AMP for efficient training

Training Results

The model showed consistent improvement throughout training, with validation loss decreasing from 0.6689 to 0.4496:

Training Loss	Epoch	Step	Validation Loss
0.7785	0.4145	100	0.6689
0.8247	0.8290	200	0.7610
0.6961	1.2404	300	0.6406
0.6305	1.6549	400	0.5726
0.5784	2.0663	500	0.5422
0.5582	2.4808	600	0.5184
0.5399	2.8953	700	0.4992
0.5132	3.3067	800	0.4786
0.4903	3.7212	900	0.4617
0.4774	4.1326	1000	0.4496

Usage

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech
import torch

# Load the model and processor
processor = SpeechT5Processor.from_pretrained("your-username/speecht5_finetuned_telugu_charan")
model = SpeechT5ForTextToSpeech.from_pretrained("your-username/speecht5_finetuned_telugu_charan")

# Prepare Telugu text input
text = "మీ Telugu వాక్యం ఇక్కడ రాయండి"  # Write your Telugu sentence here
inputs = processor(text=text, return_tensors="pt")

# Generate speech
with torch.no_grad():
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings=None)

# The speech output can be saved or played as audio

Model Performance

Final Validation Loss: 0.4496
Training Convergence: Stable convergence achieved within 1,000 steps
Phoneme Accuracy: Improved through custom EDA and token mapping optimization
Speech Quality: Natural-sounding Telugu speech generation

Technical Details

Framework Versions

Transformers: 4.47.0
PyTorch: 2.5.1+cu121
Datasets: 3.3.1
Tokenizers: 0.21.0

Model Architecture

Base Model: Microsoft SpeechT5 TTS
Encoder: Multi-modal encoder handling both text and speech
Decoder: Autoregressive decoder for speech generation
Vocoder: Compatible with HiFi-GAN for high-quality audio output

Citation

If you use this model in your research, please cite:

@misc{speecht5_telugu_charan,
  title={SpeechT5 Fine-tuned for Telugu Text-to-Speech},
  author={Rama Charan Pisupati},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/speecht5_finetuned_telugu_charan}}
}

Acknowledgments

Microsoft Research for the original SpeechT5 architecture
IndicTTS team for providing the Telugu dataset
Hugging Face for the transformers library and model hosting platform

Contact

For questions or issues related to this model, please contact:

Email: [email protected]
GitHub: Epik-Whale463

Epikwhale
/

speecht5_finetuned_telugu_charan