|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- bn |
|
|
metrics: |
|
|
- wer |
|
|
- cer |
|
|
base_model: |
|
|
- ai4bharat/indicwav2vec_v1_bengali |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
|
|
|
<div align="center"> |
|
|
<h1>π¨ BRDialect π¨ |
|
|
|
|
|
BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects </h1> |
|
|
π <a href="https://arxiv.org/abs/2510.06188"><b>Paper</b></a>, π₯οΈ <a href="https://github.com/Jak57/BanglaTalk"><b>Github</b></a> |
|
|
</div> |
|
|
|
|
|
**BRDialect** - ASR system is trained on ten regional dialects of Bangladesh using the <a href="https://www.kaggle.com/competitions/ben10">Ben10</a> dataset from Bengali.AI. |
|
|
|
|
|
## Load the BRDialect ASR System |
|
|
|
|
|
**Prerequisite**<br> |
|
|
``` |
|
|
!pip install -U transformers |
|
|
!pip install https://github.com/kpu/kenlm/archive/master.zip |
|
|
!pip install pyctcdecode |
|
|
``` |
|
|
|
|
|
**Log in to HuggingFace**<br> |
|
|
```python |
|
|
from huggingface_hub import login |
|
|
login("TOKEN") |
|
|
``` |
|
|
|
|
|
**Load base model and BRDialect**<br> |
|
|
```python |
|
|
## BRDialect |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
kenlm_model_path = hf_hub_download(repo_id="Jakir057/BRDialect", filename="BRDialect/5gram_kenlm.arpa") |
|
|
state_dict_path = hf_hub_download(repo_id="Jakir057/BRDialect", filename="BRDialect/wav2vec2_bangla_regional_dialect.pth") |
|
|
``` |
|
|
```python |
|
|
from transformers import AutoProcessor, AutoModelForCTC, Wav2Vec2ProcessorWithLM |
|
|
import torch |
|
|
import numpy as np |
|
|
import pyctcdecode |
|
|
import librosa |
|
|
|
|
|
base_model_id = "ai4bharat/indicwav2vec_v1_bengali" |
|
|
processor = AutoProcessor.from_pretrained(base_model_id) |
|
|
model = AutoModelForCTC.from_pretrained(base_model_id) |
|
|
model.load_state_dict(torch.load(state_dict_path)["model"]) |
|
|
|
|
|
vocab_dict = processor.tokenizer.get_vocab() |
|
|
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])} |
|
|
decoder = pyctcdecode.build_ctcdecoder( |
|
|
list(sorted_vocab_dict.keys()), |
|
|
str(kenlm_model_path) |
|
|
) |
|
|
processor_with_lm = Wav2Vec2ProcessorWithLM( |
|
|
feature_extractor=processor.feature_extractor, |
|
|
tokenizer=processor.tokenizer, |
|
|
decoder=decoder |
|
|
) |
|
|
model.freeze_feature_encoder() |
|
|
model.eval() |
|
|
``` |
|
|
|
|
|
## Transcription Generation |
|
|
```python |
|
|
sampling_rate = 16000 |
|
|
path = "AUDIO_PATH" |
|
|
frame, sr = librosa.load(path, sr=sampling_rate, mono=True) |
|
|
|
|
|
inputs = processor( |
|
|
frame, |
|
|
sampling_rate=sampling_rate, |
|
|
return_tensors="pt", |
|
|
padding=False |
|
|
) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(inputs.input_values.to("cpu")).logits |
|
|
|
|
|
np_logits = logits.squeeze(0).cpu().numpy() |
|
|
result = processor_with_lm.decode(np_logits, beam_width=256) |
|
|
text = result.text |
|
|
print(f"Transcription={text}") |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@article{hasan2025banglatalk, |
|
|
title={BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects}, |
|
|
author={Hasan, Jakir and Dipta, Shubhashis Roy}, |
|
|
journal={arXiv preprint arXiv:2510.06188}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
@inproceedings{javed2022towards, |
|
|
title={Towards building asr systems for the next billion users}, |
|
|
author={Javed, Tahir and Doddapaneni, Sumanth and Raman, Abhigyan and Bhogale, Kaushal Santosh and Ramesh, Gowtham and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M}, |
|
|
booktitle={Proceedings of the aaai conference on artificial intelligence}, |
|
|
volume={36}, |
|
|
number={10}, |
|
|
pages={10813--10821}, |
|
|
year={2022} |
|
|
} |
|
|
``` |