Summary

titu_stt_bn_conformer_large is a conformer based model trained on ~4.4K Hours Non-Telephony open datasets.

Using method

This model can be used for transcribing Bangla audio and also can be used as pre-trained model to fine-tuning on custom datasets using NeMo framework.

Installation

To install NeMo check NeMo documentation.

pip install -q 'nemo_toolkit[asr]'

Inferencing

Download test_bn_conformer.wav

# pip install -q 'nemo_toolkit[asr]'

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("hishab/titu_stt_bn_conformer_large")

auido_file = "test_bn_conformer.wav"
transcriptions = asr_model.transcribe([auido_file])

print(transcriptions)
# ['আজ সরকারি ছুটির দিন দেশের সব শিক্ষা প্রতিষ্ঠান সহ সরকারি আধাসরকারি
স্বায়ত্তশাসিত প্রতিষ্ঠান ও ভবনে জাতীয় পতাকা অর্ধনমিত ও কালো পতাকা উত্তোলন করা হয়েছে']

Training Datasets

Dataset Hours
Google - OpenSLR 37 (BN_bd+BN_in) 4.33
Google - OpenSLR 53 214.9
MadASR - 23 (Train+Dev) 853.79
Shrutilipi 440.33
Kaggle Comp. Dataset (Macro) (Train+Validation) 1180.33
indicsuperb - Kathbath (Train+Validation) 107.55
Mozilla - CV (Train+Dev+Validated+Invalidated+Other) 1315.72
Google - Fleurs (Train+Dev) 10.42
Vasha-Bichita 62.46
SUBAK.KO (Train+Validation) 216.61
IndicTTS (Train+Validation) 20.08
Total 4426.53

Training Details

For training the model, we selected ~4.4K hours of data from aforementioned datasets. We took nvidia/stt_en_conformer_ctc_large as base model and fine-tuned on 4xH100 GPU. Below are some more training parameters.

epoch = 100
batch_size = 64

Evaluation

For evaluating the model, we used test set with transcripts available from aforementioned datasets. Below are the details. For each dataset, we calculate WER/CER in percentage. We evaluated our ASR model on 9 different datasets and we got an average 7.9% WER

Dataset [Test set only] (Duration in Hours) Google - Fleurs (2.6) indictts (0.15) Kathbath clean (5.01) Kathbath noisy (5.01) Kathbath combined (15.48) mozilla - Common Voice (16.57) SUBAK.KO (19.97) All data Combined (54.79)
Azure STT 24.3 15.2 13.6 15.1 - 14.6 - -
Facebook MMS 34.4 - - - - 48 - -
Google STT 19.4 18.3 14.3 16.7 - 20.8 - -
IndicWav2vec 18.3 15 12.2 16.2 - 20.2 - -
IndicWhisper 11.4 7.6 10.3 12 - 15 - -
ODD-Speech 29.5 - - - - 23.6 - -
OpenAI Whisper Large v3 50 - - - - 40.3 - -
titu_stt_bn_fastconformer 33.53/6.84 30.94/6.12 27.05/5.83 31.87/7.52 29.27/6.61 42.7/11.44 24.25/6.79 30.72/7.94
titu_stt_bn_fastconformer_large_od 10.97/2.08 10.48/1.63 9.39/1.71 11.37/2.29 10.3/1.98 6.21/1.1 6.84/1.81 7.81/1.68
titu_stt_bn_conformer_large 10.46/1.86 8.74/1.54 8.95/1.62 10.89/2.2 9.84/1.89 6.53/1.16 7.21/2.01 7.9/1.75
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hishab/titu_stt_bn_conformer_large

Finetuned
(1)
this model

Collection including hishab/titu_stt_bn_conformer_large