hishab/titu_stt_bn_conformer_large

Summary

titu_stt_bn_conformer_large is a conformer based model trained on ~4.4K Hours Non-Telephony open datasets.

Using method

This model can be used for transcribing Bangla audio and also can be used as pre-trained model to fine-tuning on custom datasets using NeMo framework.

Installation

To install NeMo check NeMo documentation.

pip install -q 'nemo_toolkit[asr]'

Inferencing

Download test_bn_conformer.wav

# pip install -q 'nemo_toolkit[asr]'

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained("hishab/titu_stt_bn_conformer_large")

auido_file = "test_bn_conformer.wav"
transcriptions = asr_model.transcribe([auido_file])

print(transcriptions)
# ['আজ সরকারি ছুটির দিন দেশের সব শিক্ষা প্রতিষ্ঠান সহ সরকারি আধাসরকারি
স্বায়ত্তশাসিত প্রতিষ্ঠান ও ভবনে জাতীয় পতাকা অর্ধনমিত ও কালো পতাকা উত্তোলন করা হয়েছে']

Training Datasets

Dataset	Hours
Google - OpenSLR 37 (BN_bd+BN_in)	4.33
Google - OpenSLR 53	214.9
MadASR - 23 (Train+Dev)	853.79
Shrutilipi	440.33
Kaggle Comp. Dataset (Macro) (Train+Validation)	1180.33
indicsuperb - Kathbath (Train+Validation)	107.55
Mozilla - CV (Train+Dev+Validated+Invalidated+Other)	1315.72
Google - Fleurs (Train+Dev)	10.42
Vasha-Bichita	62.46
SUBAK.KO (Train+Validation)	216.61
IndicTTS (Train+Validation)	20.08
Total	4426.53

Training Details

For training the model, we selected ~4.4K hours of data from aforementioned datasets. We took nvidia/stt_en_conformer_ctc_large as base model and fine-tuned on 4xH100 GPU. Below are some more training parameters.

epoch = 100
batch_size = 64

Evaluation

For evaluating the model, we used test set with transcripts available from aforementioned datasets. Below are the details. For each dataset, we calculate WER/CER in percentage. We evaluated our ASR model on 9 different datasets and we got an average 7.9% WER

Dataset [Test set only] (Duration in Hours)	Google - Fleurs (2.6)	indictts (0.15)	Kathbath clean (5.01)	Kathbath noisy (5.01)	Kathbath combined (15.48)	mozilla - Common Voice (16.57)	SUBAK.KO (19.97)	All data Combined (54.79)
Azure STT	24.3	15.2	13.6	15.1	-	14.6	-	-
Facebook MMS	34.4	-	-	-	-	48	-	-
Google STT	19.4	18.3	14.3	16.7	-	20.8	-	-
IndicWav2vec	18.3	15	12.2	16.2	-	20.2	-	-
IndicWhisper	11.4	7.6	10.3	12	-	15	-	-
ODD-Speech	29.5	-	-	-	-	23.6	-	-
OpenAI Whisper Large v3	50	-	-	-	-	40.3	-	-
titu_stt_bn_fastconformer	33.53/6.84	30.94/6.12	27.05/5.83	31.87/7.52	29.27/6.61	42.7/11.44	24.25/6.79	30.72/7.94
titu_stt_bn_fastconformer_large_od	10.97/2.08	10.48/1.63	9.39/1.71	11.37/2.29	10.3/1.98	6.21/1.1	6.84/1.81	7.81/1.68
titu_stt_bn_conformer_large	10.46/1.86	8.74/1.54	8.95/1.62	10.89/2.2	9.84/1.89	6.53/1.16	7.21/2.01	7.9/1.75

hishab
/

titu_stt_bn_conformer_large