matchboxnet3x2x64-bambara-a-c

This model was trained from scratch on Panga-Azazia/Bambara-Keyword-Spotting-Aug dataset and achieves the following results on the evaluation set:

Accuracy: 0.9362
Loss: 0.1657

Model description

MatchboxNet - an end-to-end neural network for speech command recognition.

MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU, and dropout layers.

How to use this model

# Install matchboxnet
pip install git+https://github.com/Panga-az/matchboxnet.git

from matchboxnet.model import MatchboxNetForAudioClassification
from matchboxnet.feature_extraction import MatchboxNetFeatureExtractor
import torch

model = MatchboxNetForAudioClassification.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")
feature_extractor = MatchboxNetFeatureExtractor.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")

audio = "audio.wav"
batch = feature_extractor(audio,return_tensors="pt")

with torch.no_grad():
    outputs = model(**batch)
    preds = outputs.logits.argmax(-1)

model.config.id2label = {int(k): v for k, v in model.config.id2label.items()}
id2label = model.config.id2label  
label_name = id2label[preds.item()] 

print(label_name)

Intended uses & limitations

This model is intended for audio classification, particularly speech command recognition and keyword spotting in short audio clips. Limitations:

Performance depends on the dataset used for training.
The model is optimized for audio sampled at 16 kHz.
It works best with audio durations similar to those used during training (typically ~1.2.. seconds).

Training and evaluation data

The model was trained on the Panga-Azazia/Bambara-Keyword-Spotting-Aug dataset, which contains keyword-labeled speech samples in the Bambara language.

Evaluation on the validation set yields:

Accuracy: 0.9362
Loss: 0.1657

Training procedure

This model was trained using the matchboxnet Python package — a custom implementation of the MatchboxNet architecture using PyTorch and Hugging Face Transformers. The package is available on GitHub and provides all necessary components for feature extraction, configuration, model architecture, and training.

The training procedure closely follows the description in the original MatchboxNet paper:

Audio preprocessing:
Raw audio is converted into a sequence of 64 MFCCs, using 25 ms windows with a 10 ms stride.
Features are zero-padded symmetrically to ensure a fixed length of 128 time frames.
Data augmentation techniques used during training:
- Time shift perturbation in the range of [−5, +5] milliseconds
- Additive white noise with magnitudes between [−90, −46] dB
- SpecAugment with:
  - 2 time masks (size ∈ [0, 25] frames)
  - 2 frequency masks (size ∈ [0, 15] bands)
- SpecCutout with 5 rectangular masks applied on the spectrogram.

Training was performed using the 🤗 Trainer with the following hyperparameters:

learning_rate: 5e-5
train_batch_size: 4096
eval_batch_size: 4096
gradient_accumulation_steps: 16
total_train_batch_size: 65536
optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1000
mixed_precision_training: Native AMP

📘 Documentation & Examples
For full usage instructions, see the official documentation or explore example notebooks on GitHub demonstrating training, inference, and deployment.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 4096
eval_batch_size: 4096
seed: 0
gradient_accumulation_steps: 16
total_train_batch_size: 65536
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 1000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Accuracy	Validation Loss
2.753	100.0	100	0.5319	0.9275
0.6525	200.0	200	0.8894	0.3022
0.4197	300.0	300	0.9149	0.2035
0.3514	400.0	400	0.9234	0.1827
0.3104	500.0	500	0.9234	0.1741
0.2847	600.0	600	0.9319	0.1737
0.2682	700.0	700	0.9404	0.1682
0.2571	800.0	800	0.9362	0.1673
0.2521	900.0	900	0.9362	0.1666
0.2489	1000.0	1000	0.9362	0.1657

Framework versions

Transformers 4.53.0
Pytorch 2.6.0+cu124
Datasets 3.3.2
Tokenizers 0.21.2

Panga-Azazia
/

matchboxnet3x2x64-bambara-a-c

You need to agree to share your contact information to access this model