---
library_name: transformers
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: matchboxnet3x2x64-bambara-a-c
  results: []
license: apache-2.0
datasets:
- Panga-Azazia/Bambara-Keyword-Spotting-Aug
language:
- bm
pipeline_tag: audio-classification
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# matchboxnet3x2x64-bambara-a-c

This model was trained from scratch on [Panga-Azazia/Bambara-Keyword-Spotting-Aug](https://huggingface.co/datasets/Panga-Azazia/Bambara-Keyword-Spotting-Aug) dataset and achieves the following results on the evaluation set:
- Accuracy: 0.9362
- Loss: 0.1657

## Model description 
***MatchboxNet - an end-to-end neural network for speech command recognition.***

MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU, and dropout layers.

## How to use this model

```bash
# Install matchboxnet
pip install git+https://github.com/Panga-az/matchboxnet.git
```
```python
from matchboxnet.model import MatchboxNetForAudioClassification
from matchboxnet.feature_extraction import MatchboxNetFeatureExtractor
import torch

model = MatchboxNetForAudioClassification.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")
feature_extractor = MatchboxNetFeatureExtractor.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")

audio = "audio.wav"
batch = feature_extractor(audio,return_tensors="pt")

with torch.no_grad():
    outputs = model(**batch)
    preds = outputs.logits.argmax(-1)

model.config.id2label = {int(k): v for k, v in model.config.id2label.items()}
id2label = model.config.id2label  
label_name = id2label[preds.item()] 

print(label_name)
```
## Intended uses & limitations

This model is intended for audio classification, particularly speech command recognition and keyword spotting in short audio clips.
**Limitations:**
- Performance depends on the dataset used for training.
- The model is optimized for audio sampled at 16 kHz.
- It works best with audio durations similar to those used during training (typically ~1.2.. seconds).

## Training and evaluation data

The model was trained on the [Panga-Azazia/Bambara-Keyword-Spotting-Aug](https://huggingface.co/datasets/Panga-Azazia/Bambara-Keyword-Spotting-Aug) dataset,
which contains keyword-labeled speech samples in the Bambara language.

#### Evaluation on the validation set yields:
- **Accuracy: 0.9362**
- **Loss: 0.1657**

## Training procedure
This model was trained using the ***matchboxnet*** Python package — a custom implementation of the MatchboxNet architecture using **PyTorch** and **Hugging Face Transformers**. The package is available on [GitHub](https://github.com/Panga-az/matchboxnet.git) and provides all necessary components for feature extraction, configuration, model architecture, and training.

The training procedure closely follows the description in the original [MatchboxNet paper](https://arxiv.org/pdf/2004.08531):

- **Audio preprocessing**:  
  Raw audio is converted into a sequence of **64 MFCCs**, using 25 ms windows with a 10 ms stride.  
  Features are **zero-padded symmetrically** to ensure a fixed length of 128 time frames.

- **Data augmentation techniques** used during training:
  - Time shift perturbation in the range of **[−5, +5] milliseconds**
  - Additive white noise with magnitudes between **[−90, −46] dB**
  - **SpecAugment** with:
    - 2 time masks (size ∈ [0, 25] frames)
    - 2 frequency masks (size ∈ [0, 15] bands)
  - **SpecCutout** with 5 rectangular masks applied on the spectrogram.

Training was performed using the 🤗 `Trainer` with the following hyperparameters:

- **learning_rate**: 5e-5  
- **train_batch_size**: 4096  
- **eval_batch_size**: 4096  
- **gradient_accumulation_steps**: 16  
- **total_train_batch_size**: 65536  
- **optimizer**: AdamW with betas=(0.9, 0.999), epsilon=1e-08  
- **lr_scheduler_type**: linear  
- **num_epochs**: 1000  
- **mixed_precision_training**: Native AMP

---

📘 **Documentation & Examples**  
For full usage instructions, see the [official documentation](https://panga-az.github.io/matchboxnet/) or explore example notebooks on [GitHub](https://github.com/Panga-Az/matchboxnet) demonstrating training, inference, and deployment.


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4096
- eval_batch_size: 4096
- seed: 0
- gradient_accumulation_steps: 16
- total_train_batch_size: 65536
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 1000
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch  | Step | Accuracy | Validation Loss |
|:-------------:|:------:|:----:|:--------:|:---------------:|
| 2.753         | 100.0  | 100  | 0.5319   | 0.9275          |
| 0.6525        | 200.0  | 200  | 0.8894   | 0.3022          |
| 0.4197        | 300.0  | 300  | 0.9149   | 0.2035          |
| 0.3514        | 400.0  | 400  | 0.9234   | 0.1827          |
| 0.3104        | 500.0  | 500  | 0.9234   | 0.1741          |
| 0.2847        | 600.0  | 600  | 0.9319   | 0.1737          |
| 0.2682        | 700.0  | 700  | 0.9404   | 0.1682          |
| 0.2571        | 800.0  | 800  | 0.9362   | 0.1673          |
| 0.2521        | 900.0  | 900  | 0.9362   | 0.1666          |
| 0.2489        | 1000.0 | 1000 | 0.9362   | 0.1657          |


### Framework versions

- Transformers 4.53.0
- Pytorch 2.6.0+cu124
- Datasets 3.3.2
- Tokenizers 0.21.2