You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

matchboxnet3x2x64-bambara-a-c

This model was trained from scratch on Panga-Azazia/Bambara-Keyword-Spotting-Aug dataset and achieves the following results on the evaluation set:

  • Accuracy: 0.9362
  • Loss: 0.1657

Model description

MatchboxNet - an end-to-end neural network for speech command recognition.

MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU, and dropout layers.

How to use this model

# Install matchboxnet
pip install git+https://github.com/Panga-az/matchboxnet.git
from matchboxnet.model import MatchboxNetForAudioClassification
from matchboxnet.feature_extraction import MatchboxNetFeatureExtractor
import torch

model = MatchboxNetForAudioClassification.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")
feature_extractor = MatchboxNetFeatureExtractor.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c")

audio = "audio.wav"
batch = feature_extractor(audio,return_tensors="pt")

with torch.no_grad():
    outputs = model(**batch)
    preds = outputs.logits.argmax(-1)

model.config.id2label = {int(k): v for k, v in model.config.id2label.items()}
id2label = model.config.id2label  
label_name = id2label[preds.item()] 

print(label_name)

Intended uses & limitations

This model is intended for audio classification, particularly speech command recognition and keyword spotting in short audio clips. Limitations:

  • Performance depends on the dataset used for training.
  • The model is optimized for audio sampled at 16 kHz.
  • It works best with audio durations similar to those used during training (typically ~1.2.. seconds).

Training and evaluation data

The model was trained on the Panga-Azazia/Bambara-Keyword-Spotting-Aug dataset, which contains keyword-labeled speech samples in the Bambara language.

Evaluation on the validation set yields:

  • Accuracy: 0.9362
  • Loss: 0.1657

Training procedure

This model was trained using the matchboxnet Python package β€” a custom implementation of the MatchboxNet architecture using PyTorch and Hugging Face Transformers. The package is available on GitHub and provides all necessary components for feature extraction, configuration, model architecture, and training.

The training procedure closely follows the description in the original MatchboxNet paper:

  • Audio preprocessing:
    Raw audio is converted into a sequence of 64 MFCCs, using 25 ms windows with a 10 ms stride.
    Features are zero-padded symmetrically to ensure a fixed length of 128 time frames.

  • Data augmentation techniques used during training:

    • Time shift perturbation in the range of [βˆ’5, +5] milliseconds
    • Additive white noise with magnitudes between [βˆ’90, βˆ’46] dB
    • SpecAugment with:
      • 2 time masks (size ∈ [0, 25] frames)
      • 2 frequency masks (size ∈ [0, 15] bands)
    • SpecCutout with 5 rectangular masks applied on the spectrogram.

Training was performed using the πŸ€— Trainer with the following hyperparameters:

  • learning_rate: 5e-5
  • train_batch_size: 4096
  • eval_batch_size: 4096
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 65536
  • optimizer: AdamW with betas=(0.9, 0.999), epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1000
  • mixed_precision_training: Native AMP

πŸ“˜ Documentation & Examples
For full usage instructions, see the official documentation or explore example notebooks on GitHub demonstrating training, inference, and deployment.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 4096
  • eval_batch_size: 4096
  • seed: 0
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 65536
  • optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 1000
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Accuracy Validation Loss
2.753 100.0 100 0.5319 0.9275
0.6525 200.0 200 0.8894 0.3022
0.4197 300.0 300 0.9149 0.2035
0.3514 400.0 400 0.9234 0.1827
0.3104 500.0 500 0.9234 0.1741
0.2847 600.0 600 0.9319 0.1737
0.2682 700.0 700 0.9404 0.1682
0.2571 800.0 800 0.9362 0.1673
0.2521 900.0 900 0.9362 0.1666
0.2489 1000.0 1000 0.9362 0.1657

Framework versions

  • Transformers 4.53.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.3.2
  • Tokenizers 0.21.2
Downloads last month
15
Safetensors
Model size
163k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Panga-Azazia/matchboxnet3x2x64-bambara-a-c