Orange
/

SSA-HuBERT-base-60k

+---
+license: cc-by-nc-4.0
+metrics:
+- cer
+- wer
+library_name: speechbrain
+pipeline_tag: automatic-speech-recognition
+tags:
+- speech processing
+- self-supervision
+- african languages
+- fine-tuning
+---
+## Model description
+This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k) is based on a HuBERT Base architecture (~95M params) [1].
+It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa.
+### Pretraining data
+- Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech).
+- Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).
+## ASR fine-tuning
+The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model.
+Fine-tuning is done for each language using the FLEURS dataset [2].
+The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top.
+## License
+This model is released under the CC-by-NC 4.0 conditions.
+## Publication
+This model were presented at AfricaNLP 2024.
+The associated paper is available here: [Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context](https://openreview.net/forum?id=zLOhcft2E7)
+### Citation
+Please cite our paper when using SSA-HuBERT-base-60k model:
+	Caubrière, A., & Gauthier, E. (2024). Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context. In 5th Workshop on African Natural Language Processing (AfricaNLP 2024).
+**Bibtex citation:**
+@inproceedings{caubri{\`e}re2024ssaspeechssl,
+	title={Africa-Centric Self-Supervised Pretraining for Multilingual Speech Representation in a Sub-Saharan Context},
+	author={Antoine Caubri{\`e}re and Elodie Gauthier},
+	booktitle={5th Workshop on African Natural Language Processing},
+	year={2024},
+	url={https://openreview.net/forum?id=zLOhcft2E7}}
+## Results
+The following results are obtained in a greedy mode (no language model rescoring).
+Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.
+| **Language**      | **CER**   | **WER**   |
+| :----------------- | :--------- | :--------- |
+| **Afrikaans**     | 23.3      | 68.4      |
+| **Amharic**       | 15.9      | 52.7      |
+| **Fula**          | 21.2      | 61.9      |
+| **Ganda**         | 11.5      | 52.8      |
+| **Hausa**         | 10.5      | 32.5      |
+| **Igbo**          | 19.7      | 57.5      |
+| **Kamba**         | 16.1      | 53.9      |
+| **Lingala**       | 8.7       | 24.7      |
+| **Luo**           | 9.9       | 38.9      |
+| **Northen-Sotho** | 13.5      | 43.2      |
+| **Nyanja**        | 13.3      | 54.2      |
+| **Oromo**         | 22.8      | 78.1      |
+| **Shona**         | 11.6      | 50.2      |
+| **Somali**        | 21.6      | 64.9      |
+| **Swahili**       | 7.1       | 23.8      |
+| **Umbundu**       | 21.7      | 61.7      |
+| **Wolof**         | 19.4      | 55.0      |
+| **Xhosa**         | 11.9      | 51.6      |
+| **Yoruba**        | 24.3      | 67.5      |
+| **Zulu**          | 12.2      | 53.4      |
+| *Overall average* | *15.8*    | *52.3*    |
+## Reproductibilty
+We propose a notebook to reproduce the ASR experiments mentioned in our paper. See `SB_ASR_FLEURS_finetuning.ipynb`.
+By using the `ASR_FLEURS-swahili_hf.yaml` config file, you will be able to run the recipe on Swahili.
+## References
+[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.
+[2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.