QuartzNet 15x5 CTC Bambara

| |

stt-bm-quartznet15x5-v1 is a fine-tuned version of RobotsMali/stt-bm-quartznet15x5-V0 on RobotsMali/kunkado. This model cannot write Punctuations and Capitalizations, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in its training dataset.

This is the smallest of a series of model that we are developing to be able to transcribe modern Bamako Bambara. The model doesn't tag code swicthed expressions in its transcription since for training this model we decided to treat them as a modern variant of the Bambara Language removing all tags and markages. The model was fine-tuned using NVIDIA NeMo and is trained with CTC (Connectionist Temporal Classification) Loss.

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. A human evaluation report of the model is coming soon. Users should be aware that:

The model may not generalize very well accross all speaking conditions and dialects.
Community feedback is welcome, and contributions are encouraged to refine the model further.

NVIDIA NeMo: Training

To fine-tune or use the model, install NVIDIA NeMo. We recommend installing it after setting up the latest PyTorch version.

pip install nemo_toolkit['asr']

How to Use This Model

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5-v1")

Transcribe Audio

asr_model.eval()
# Assuming you have a test audio file named sample_audio.wav
output = asr_model.transcribe(['sample_audio.wav'])
print(output.text)

Input

This model accepts any mono-channel audio (wav files) as input and resamples them to 16 kHz sample rate before performing the forward pass

Output

This model provides transcribed speech as a string for a given speech sample and return an Hypothesis object (under nemo>=2.3)

Model Architecture

QuartzNet is a convolutional architecture, which consists of 1D time-channel separable convolutions optimized for speech recognition. More information on QuartzNet can be found here: QuartzNet Model.

Training

The NeMo toolkit (version 2.3.0) was used to fine-tune this model for 64300 steps over the RobotsMali/stt-bm-quartznet15x5-V0 model. This model is trained with this base config. The full training configurations, scripts, and experimental logs are available here:

🔗 Bambara-ASR Experiments

Dataset

This model was fine-tuned on the kunkado dataset, the human-reviewed subset, which consists of ~40 hours of transcribed Bambara speech data. The text was normalized with the bambara-normalizer prior to training, normalizing numbers, removing punctuations, removings tags and converting to lower case.

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%).

Version	Tokenizer	Vocabulary Size	bam-asr-early	Kunkado
v0	Character-wise	45	46.5	-
v1	Character-wise	46	-	55.5

These are greedy WER numbers without external LM and no beam search decoding.

License

This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.

Feel free to open a discussion on Hugging Face or file an issue on GitHub if you have any contributions.

Downloads last month: 61

Model tree for RobotsMali/stt-bm-quartznet15x5-v1

Base model

RobotsMali/stt-bm-quartznet15x5-V0

Finetuned

(1)

this model

Dataset used to train RobotsMali/stt-bm-quartznet15x5-v1

Evaluation results

Test WER on kunkado (human-reviewed)
test set self-reported

55.500

View on Papers With Code