matchboxnet3x2x64-bambara-a-c
This model was trained from scratch on Panga-Azazia/Bambara-Keyword-Spotting-Aug dataset and achieves the following results on the evaluation set:
- Accuracy: 0.9362
- Loss: 0.1657
Model description
MatchboxNet - an end-to-end neural network for speech command recognition.
MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU, and dropout layers.
How to use this model
# Install matchboxnet
pip install git+https://github.com/Panga-az/matchboxnet.git
from matchboxnet.model import MatchboxNetForAudioClassification
from matchboxnet.feature_extraction import MatchboxNetFeatureExtractor
import torch
model = MatchboxNetForAudioClassification.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")
feature_extractor = MatchboxNetFeatureExtractor.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")
audio = "audio.wav"
batch = feature_extractor(audio,return_tensors="pt")
with torch.no_grad():
outputs = model(**batch)
preds = outputs.logits.argmax(-1)
model.config.id2label = {int(k): v for k, v in model.config.id2label.items()}
id2label = model.config.id2label
label_name = id2label[preds.item()]
print(label_name)
Intended uses & limitations
This model is intended for audio classification, particularly speech command recognition and keyword spotting in short audio clips. Limitations:
- Performance depends on the dataset used for training.
- The model is optimized for audio sampled at 16 kHz.
- It works best with audio durations similar to those used during training (typically ~1.2.. seconds).
Training and evaluation data
The model was trained on the Panga-Azazia/Bambara-Keyword-Spotting-Aug dataset, which contains keyword-labeled speech samples in the Bambara language.
Evaluation on the validation set yields:
- Accuracy: 0.9362
- Loss: 0.1657
Training procedure
This model was trained using the matchboxnet Python package β a custom implementation of the MatchboxNet architecture using PyTorch and Hugging Face Transformers. The package is available on GitHub and provides all necessary components for feature extraction, configuration, model architecture, and training.
The training procedure closely follows the description in the original MatchboxNet paper:
Audio preprocessing:
Raw audio is converted into a sequence of 64 MFCCs, using 25 ms windows with a 10 ms stride.
Features are zero-padded symmetrically to ensure a fixed length of 128 time frames.Data augmentation techniques used during training:
- Time shift perturbation in the range of [β5, +5] milliseconds
- Additive white noise with magnitudes between [β90, β46] dB
- SpecAugment with:
- 2 time masks (size β [0, 25] frames)
- 2 frequency masks (size β [0, 15] bands)
- SpecCutout with 5 rectangular masks applied on the spectrogram.
Training was performed using the π€ Trainer
with the following hyperparameters:
- learning_rate: 5e-5
- train_batch_size: 4096
- eval_batch_size: 4096
- gradient_accumulation_steps: 16
- total_train_batch_size: 65536
- optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1000
- mixed_precision_training: Native AMP
π Documentation & Examples
For full usage instructions, see the official documentation or explore example notebooks on GitHub demonstrating training, inference, and deployment.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4096
- eval_batch_size: 4096
- seed: 0
- gradient_accumulation_steps: 16
- total_train_batch_size: 65536
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 1000
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Accuracy | Validation Loss |
---|---|---|---|---|
2.753 | 100.0 | 100 | 0.5319 | 0.9275 |
0.6525 | 200.0 | 200 | 0.8894 | 0.3022 |
0.4197 | 300.0 | 300 | 0.9149 | 0.2035 |
0.3514 | 400.0 | 400 | 0.9234 | 0.1827 |
0.3104 | 500.0 | 500 | 0.9234 | 0.1741 |
0.2847 | 600.0 | 600 | 0.9319 | 0.1737 |
0.2682 | 700.0 | 700 | 0.9404 | 0.1682 |
0.2571 | 800.0 | 800 | 0.9362 | 0.1673 |
0.2521 | 900.0 | 900 | 0.9362 | 0.1666 |
0.2489 | 1000.0 | 1000 | 0.9362 | 0.1657 |
Framework versions
- Transformers 4.53.0
- Pytorch 2.6.0+cu124
- Datasets 3.3.2
- Tokenizers 0.21.2
- Downloads last month
- 15