You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Whisper Mini ta - RAGUNATH RAVI

This model is a fine-tuned version of openai/whisper-small on the whisperaudio dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0905
  • WER: 18.7042

Model description

This is a Whisper Small model fine-tuned specifically for Tamil language automatic speech recognition (ASR). The model has been adapted to recognize and transcribe Tamil speech with higher accuracy than the base OpenAI Whisper model.

The fine-tuning process focused on optimizing the model for Tamil phonetics, vocabulary, and speech patterns using a custom web-scraped dataset. The model uses the Whisper architecture, which employs a Transformer-based encoder-decoder architecture with attention mechanisms specifically designed for speech recognition tasks.

This model is particularly well-suited for Tamil speech recognition applications, achieving a Word Error Rate (WER) of 18.70% on the test set, demonstrating significant improvements over the base model for Tamil language speech.

Intended uses & limitations

Intended uses

  • Transcription of Tamil speech in audio and video content
  • Voice command systems for Tamil speakers
  • Accessibility tools for Tamil-speaking users
  • Documentation of Tamil audio content
  • Subtitling and captioning services for Tamil media

Limitations

  • The model may struggle with heavily accented Tamil speech or regional dialects that were not well-represented in the training data
  • Performance may degrade with noisy audio inputs or low-quality recordings
  • The model might have difficulty with specialized terminology or domain-specific language not present in the training data
  • The model is specifically trained for Tamil and will not perform well on other languages

Training and evaluation data

The model was fine-tuned on a custom web-scraped dataset called "whisperaudio" (available at ragunath123/whisperaudio on Hugging Face). This dataset consists of Tamil speech audio paired with accurate transcriptions.

For training, 12,000 samples were used from the dataset, while 3,000 samples were used for evaluation. The audio was processed by resampling from 48kHz to 16kHz to match Whisper's requirements.

The dataset includes a diverse range of Tamil speech samples, which helps the model generalize across different speakers, accents, and content types.

Training procedure

Preprocessing

  • Audio files were resampled from their original sampling rate to 16kHz
  • Log-Mel spectrograms were extracted as input features using the Whisper feature extractor
  • Text was tokenized using the Whisper tokenizer configured specifically for the Tamil language
  • Special care was taken to handle the tokenization of Tamil characters correctly

Framework versions

  • Transformers 4.40.2
  • PyTorch 2.7.0+cu126
  • Datasets 3.5.1
  • Tokenizers 0.19.1

Training hyperparameters

The following hyperparameters were used during training:

  • Learning rate: 1e-05
  • Train batch size: 32
  • Evaluation batch size: 16
  • Seed: 42
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • Learning rate scheduler: Linear with warmup
  • Learning rate scheduler warmup steps: 500
  • Total training steps: 4000
  • Mixed precision training: Native AMP
  • Gradient accumulation steps: 1
  • Evaluation strategy: Steps (every 1000 steps)
  • Gradient checkpointing: Enabled
  • FP16 training: Enabled

Training results

Training Loss Epoch Step Validation Loss WER
0.0585 2.6667 1000 0.0872 20.2050
0.0123 5.3333 2000 0.0905 18.7042
0.0047 8.0 3000 0.1033 18.7719
0.0015 10.6667 4000 0.1116 18.8828

The model achieved its best performance at epoch 5.3 (step 2000) with a WER of 18.7042%.

Model configuration

The model was configured specifically for Tamil language transcription:

  • Language set to "tamil"
  • Task set to "transcribe"
  • Forced decoder IDs were set to None to allow the model more flexibility in generation

Evaluation

The model was evaluated using the Word Error Rate (WER) metric, which measures the percentage of words incorrectly transcribed. The final model achieved a WER of 18.70%, indicating that approximately 81.3% of words were correctly transcribed.

The evaluation was performed on a held-out test set of 3,000 samples from the whisperaudio dataset, ensuring a fair assessment of the model's performance on unseen data.

Usage

from transformers import pipeline
import torch

# Load the model
asr = pipeline(
    "automatic-speech-recognition",
    model="ragunath-ravi/whisper-mini-ta",
    device=0 if torch.cuda.is_available() else "cpu"
)

# Transcribe audio
result = asr("path_to_audio_file.wav", language="ta", task="transcribe")
print(result["text"])
Downloads last month
15
Safetensors
Model size
242M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ragunath-ravi/whisper-mini-ta

Finetuned
(2591)
this model

Dataset used to train ragunath-ravi/whisper-mini-ta

Space using ragunath-ravi/whisper-mini-ta 1

Evaluation results