language:
- bm
library_name: nemo
datasets:
- RobotsMali/kunkado
- RobotsMali/bam-asr-early
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transducer
- TDT
- FastConformer
- Conformer
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: nvidia/parakeet-ctc-0.6b
model-index:
- name: soloba-ctc-0.6b-v0
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: bam-asr-early
type: RobotsMali/bam-asr-early
split: test
args:
language: bm
metrics:
- name: Test WER
type: wer
value: 35.15760898590088
metrics:
- wer
pipeline_tag: automatic-speech-recognition
Soloni TDT-CTC 114M Bambara
soloba-ctc-0.6b-v0
is a fine tuned version of nvidia/parakeet-ctc-0.6b
on RobotsMali/kunkado and RobotsMali/bam-asr-early. This model cannot does produce Capitalizations but not Punctuations. The model was fine-tuned using NVIDIA NeMo.
The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages.
๐จ Important Note
This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:
- The model may not generalize very well accross all speaking conditions and dialects.
- Community feedback is welcome, and contributions are encouraged to refine the model further.
NVIDIA NeMo: Training
To fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.
pip install nemo_toolkit['asr']
How to Use This Model
Note that this model has been released for research purposes primarily.
Load Model with NeMo
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="RobotsMali/soloba-ctc-0.6b-v0")
Transcribe Audio
model.eval()
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
Input
This model accepts any mono-channel audio (wav files) as input and resamples them to 16 kHz sample rate before performing the forward pass
Output
This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3)
Model Architecture
This model uses a FastConformer Ecoder and a CTC decoder. FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: Fast-Conformer Model. And a Convolutional Neural Net with CTC loss, the Connectionist Temporal Classification decoder
Training
The NeMo toolkit (version 2.3.0) was used for finetuning this model for 183,086 steps over nvidia/parakeet-ctc-0.6b
model. This version is trained with this base config. The full training configurations, scripts, and experimental logs are available here:
The tokenizers for these models were built using the text transcripts of the train set with this script.
Dataset
This model was fine-tuned on the kunkado dataset, the semi-labelled subset, which consists of ~120 hours of automatically annotated Bambara speech data, and the bam-asr-early dataset.
Performance
We report the Word Error Rate on the test set of bam-asr-early.
Decoder (Version) | Tokenizer | Vocabulary Size | bam-asr-early |
---|---|---|---|
v0 | BPE | 512 | 35.16 |
License
This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.
Feel free to open a discussion on Hugging Face or file an issue on github if you have any contributions