--- library_name: transformers tags: - generated_from_trainer metrics: - accuracy model-index: - name: matchboxnet3x2x64-bambara-a-c results: [] license: apache-2.0 datasets: - Panga-Azazia/Bambara-Keyword-Spotting-Aug language: - bm pipeline_tag: audio-classification --- # matchboxnet3x2x64-bambara-a-c This model was trained from scratch on [Panga-Azazia/Bambara-Keyword-Spotting-Aug](https://huggingface.co/datasets/Panga-Azazia/Bambara-Keyword-Spotting-Aug) dataset and achieves the following results on the evaluation set: - Accuracy: 0.9362 - Loss: 0.1657 ## Model description ***MatchboxNet - an end-to-end neural network for speech command recognition.*** MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU, and dropout layers. ## How to use this model ```bash # Install matchboxnet pip install git+https://github.com/Panga-az/matchboxnet.git ``` ```python from matchboxnet.model import MatchboxNetForAudioClassification from matchboxnet.feature_extraction import MatchboxNetFeatureExtractor import torch model = MatchboxNetForAudioClassification.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c") feature_extractor = MatchboxNetFeatureExtractor.from_pretrained("Panga-Azazia/matchboxnet3x2x64-bambara-a-c") audio = "audio.wav" batch = feature_extractor(audio,return_tensors="pt") with torch.no_grad(): outputs = model(**batch) preds = outputs.logits.argmax(-1) model.config.id2label = {int(k): v for k, v in model.config.id2label.items()} id2label = model.config.id2label label_name = id2label[preds.item()] print(label_name) ``` ## Intended uses & limitations This model is intended for audio classification, particularly speech command recognition and keyword spotting in short audio clips. **Limitations:** - Performance depends on the dataset used for training. - The model is optimized for audio sampled at 16 kHz. - It works best with audio durations similar to those used during training (typically ~1.2.. seconds). ## Training and evaluation data The model was trained on the [Panga-Azazia/Bambara-Keyword-Spotting-Aug](https://huggingface.co/datasets/Panga-Azazia/Bambara-Keyword-Spotting-Aug) dataset, which contains keyword-labeled speech samples in the Bambara language. #### Evaluation on the validation set yields: - **Accuracy: 0.9362** - **Loss: 0.1657** ## Training procedure This model was trained using the ***matchboxnet*** Python package — a custom implementation of the MatchboxNet architecture using **PyTorch** and **Hugging Face Transformers**. The package is available on [GitHub](https://github.com/Panga-az/matchboxnet.git) and provides all necessary components for feature extraction, configuration, model architecture, and training. The training procedure closely follows the description in the original [MatchboxNet paper](https://arxiv.org/pdf/2004.08531): - **Audio preprocessing**: Raw audio is converted into a sequence of **64 MFCCs**, using 25 ms windows with a 10 ms stride. Features are **zero-padded symmetrically** to ensure a fixed length of 128 time frames. - **Data augmentation techniques** used during training: - Time shift perturbation in the range of **[−5, +5] milliseconds** - Additive white noise with magnitudes between **[−90, −46] dB** - **SpecAugment** with: - 2 time masks (size ∈ [0, 25] frames) - 2 frequency masks (size ∈ [0, 15] bands) - **SpecCutout** with 5 rectangular masks applied on the spectrogram. Training was performed using the 🤗 `Trainer` with the following hyperparameters: - **learning_rate**: 5e-5 - **train_batch_size**: 4096 - **eval_batch_size**: 4096 - **gradient_accumulation_steps**: 16 - **total_train_batch_size**: 65536 - **optimizer**: AdamW with betas=(0.9, 0.999), epsilon=1e-08 - **lr_scheduler_type**: linear - **num_epochs**: 1000 - **mixed_precision_training**: Native AMP --- 📘 **Documentation & Examples** For full usage instructions, see the [official documentation](https://panga-az.github.io/matchboxnet/) or explore example notebooks on [GitHub](https://github.com/Panga-Az/matchboxnet) demonstrating training, inference, and deployment. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 4096 - eval_batch_size: 4096 - seed: 0 - gradient_accumulation_steps: 16 - total_train_batch_size: 65536 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 1000 - mixed_precision_training: Native AMP ### Training results | Training Loss | Epoch | Step | Accuracy | Validation Loss | |:-------------:|:------:|:----:|:--------:|:---------------:| | 2.753 | 100.0 | 100 | 0.5319 | 0.9275 | | 0.6525 | 200.0 | 200 | 0.8894 | 0.3022 | | 0.4197 | 300.0 | 300 | 0.9149 | 0.2035 | | 0.3514 | 400.0 | 400 | 0.9234 | 0.1827 | | 0.3104 | 500.0 | 500 | 0.9234 | 0.1741 | | 0.2847 | 600.0 | 600 | 0.9319 | 0.1737 | | 0.2682 | 700.0 | 700 | 0.9404 | 0.1682 | | 0.2571 | 800.0 | 800 | 0.9362 | 0.1673 | | 0.2521 | 900.0 | 900 | 0.9362 | 0.1666 | | 0.2489 | 1000.0 | 1000 | 0.9362 | 0.1657 | ### Framework versions - Transformers 4.53.0 - Pytorch 2.6.0+cu124 - Datasets 3.3.2 - Tokenizers 0.21.2