Summary
titu_stt_bn_conformer_large is a conformer based model trained on ~4.4K Hours Non-Telephony open datasets.
Using method
This model can be used for transcribing Bangla audio and also can be used as pre-trained model to fine-tuning on custom datasets using NeMo framework.
Installation
To install NeMo check NeMo documentation.
pip install -q 'nemo_toolkit[asr]'
Inferencing
Download test_bn_conformer.wav
# pip install -q 'nemo_toolkit[asr]'
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("hishab/titu_stt_bn_conformer_large")
auido_file = "test_bn_conformer.wav"
transcriptions = asr_model.transcribe([auido_file])
print(transcriptions)
# ['আজ সরকারি ছুটির দিন দেশের সব শিক্ষা প্রতিষ্ঠান সহ সরকারি আধাসরকারি
স্বায়ত্তশাসিত প্রতিষ্ঠান ও ভবনে জাতীয় পতাকা অর্ধনমিত ও কালো পতাকা উত্তোলন করা হয়েছে']
Training Datasets
Dataset | Hours |
---|---|
Google - OpenSLR 37 (BN_bd+BN_in) | 4.33 |
Google - OpenSLR 53 | 214.9 |
MadASR - 23 (Train+Dev) | 853.79 |
Shrutilipi | 440.33 |
Kaggle Comp. Dataset (Macro) (Train+Validation) | 1180.33 |
indicsuperb - Kathbath (Train+Validation) | 107.55 |
Mozilla - CV (Train+Dev+Validated+Invalidated+Other) | 1315.72 |
Google - Fleurs (Train+Dev) | 10.42 |
Vasha-Bichita | 62.46 |
SUBAK.KO (Train+Validation) | 216.61 |
IndicTTS (Train+Validation) | 20.08 |
Total | 4426.53 |
Training Details
For training the model, we selected ~4.4K hours of data from aforementioned datasets. We took nvidia/stt_en_conformer_ctc_large as base model and fine-tuned on 4xH100 GPU. Below are some more training parameters.
epoch = 100
batch_size = 64
Evaluation
For evaluating the model, we used test set with transcripts available from aforementioned datasets. Below are the details. For each dataset, we calculate WER/CER in percentage. We evaluated our ASR model on 9 different datasets and we got an average 7.9% WER
Dataset [Test set only] (Duration in Hours) | Google - Fleurs (2.6) | indictts (0.15) | Kathbath clean (5.01) | Kathbath noisy (5.01) | Kathbath combined (15.48) | mozilla - Common Voice (16.57) | SUBAK.KO (19.97) | All data Combined (54.79) |
---|---|---|---|---|---|---|---|---|
Azure STT | 24.3 | 15.2 | 13.6 | 15.1 | - | 14.6 | - | - |
Facebook MMS | 34.4 | - | - | - | - | 48 | - | - |
Google STT | 19.4 | 18.3 | 14.3 | 16.7 | - | 20.8 | - | - |
IndicWav2vec | 18.3 | 15 | 12.2 | 16.2 | - | 20.2 | - | - |
IndicWhisper | 11.4 | 7.6 | 10.3 | 12 | - | 15 | - | - |
ODD-Speech | 29.5 | - | - | - | - | 23.6 | - | - |
OpenAI Whisper Large v3 | 50 | - | - | - | - | 40.3 | - | - |
titu_stt_bn_fastconformer | 33.53/6.84 | 30.94/6.12 | 27.05/5.83 | 31.87/7.52 | 29.27/6.61 | 42.7/11.44 | 24.25/6.79 | 30.72/7.94 |
titu_stt_bn_fastconformer_large_od | 10.97/2.08 | 10.48/1.63 | 9.39/1.71 | 11.37/2.29 | 10.3/1.98 | 6.21/1.1 | 6.84/1.81 | 7.81/1.68 |
titu_stt_bn_conformer_large | 10.46/1.86 | 8.74/1.54 | 8.95/1.62 | 10.89/2.2 | 9.84/1.89 | 6.53/1.16 | 7.21/2.01 | 7.9/1.75 |
- Downloads last month
- 11
Model tree for hishab/titu_stt_bn_conformer_large
Base model
nvidia/stt_en_conformer_ctc_large