Beijuka commited on
Commit
74e9683
·
verified ·
1 Parent(s): f3b2a82

Upload benchmark.md

Browse files
Files changed (1) hide show
  1. benchmark.md +25 -0
benchmark.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ASR Africa Benchmark Dataset
2
+ The main objective of this study is to develop an evidence base for the amount of speech data required to build a good automatic speech recognition model across priority “low-resource” African languages and key domain areas. This was achieved by developing ASR models for African languages, evaluating their performance, and building benchmark speech corpora for these languages. The African languages in discussion are Luganda, Kinyarwanda, Lingala, Swahili, Ahmaric, Oromo, Yoruba, Hausa, Igbo, Wolof, Fula, Ewe, Zulu, Xhosa, Afrikaans, Bemba and Shona.
3
+
4
+ # Benchmark Datasets Characteristics
5
+
6
+ | Dataset | Domain | Speech Type| Languages | License | URL | Version | Date of Publication |
7
+ | ------------ | ------ | -----------|---------- | -------------- | --- | ------- | ------------------ |
8
+ | Common Voice | Generic | Read |Swahili, Luganda, Kinyarwanda | MPL-2.0 | [link](https://commonvoice.mozilla.org/en/datasets) | V19 | 09/18/2024 |
9
+ | FLEURS | Generic | Read | Wolof, Swahili, Luganda, Lingala | CC-BY-4.0 | [link](https://huggingface.co/datasets/google/fleurs) | V0 | 05/25/2022 |
10
+ | Naija Voices | Generic | Read |Igbo, Yoruba, Hausa | CC-BY-NC-SA-4.0 | [link](https://huggingface.co/datasets/naijavoices/naijavoices-dataset) | V0 | 05/06/2024 |
11
+ | BIG-C | Generic | Conversational| Bemba | CC-BY-NC-ND-4.0 | [link](https://github.com/csikasote/bigc) | V0 | 05/26/2023 |
12
+ | NCHLT | Generic | Read | Zulu, Xhosa, Afrikaans | CC-BY-3.0 | [link](https://repo.sadilar.org/handle/20.500.12185/280) | V0 | 02/06/2018 |
13
+ | ALFFA | Generic | Read | Swahili, Wolof | MIT | [link](https://www.openslr.org/25/) | V0 | 04/14/2015 |
14
+ | GRIOTS | Generic | Conversational| Bambara | CC-BY-4.0 | [link](https://zenodo.org/records/6997806) | V2.0 | 07/11/2023 |
15
+ | AfriVoice | Generic | Spontaneous | Lingala, Shona | CC-BY-4.0 | [link](https://huggingface.co/datasets/DigitalUmuganda/AfriVoice) | V1.1.0 | 03/26/2024 |
16
+ | Kallaama | Agriculture | Spontaneous| Wolof | CC-BY-4.0 | [link](https://github.com/gauthelo/kallaama-speech-dataset) | V0 | 29/03/2024 |
17
+ | Yogera | Generic | Descriptive | Luganda | CC-BY-SA-4.0 | [link](https://github.com/AI-Lab-Makerere/Yogera-Dataset-Metadata) | V4.0.1 | 08/13/2024 |
18
+ | Asheshi Financial | Finance | Spontaneous| Akan | CC-BY-4.0 | [link](https://github.com/Ashesi-Org/Financial-Inclusion-Speech-Dataset) | V0 | 24/06/2024 |
19
+ | Lingala Read Speech Corpus | Generic | Read| Lingala| CC BY 4.0 | [link](https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/28x8tc9n9k-1.zip) | V1 | 22/09/2023 |
20
+ | Amharic ASR Dataset | Generic | Mixed | Amharic | CC-BY-4.0 |[link](https://figshare.com/articles/dataset/Yohannes_A_Ejigu_Amharic_ASR_Dataset_zip/24959727) | V0 | 08/01/2024|
21
+ | BembaSpeech ASR Corpus | Generic | Read | Bemba |CC-BY-NC-4.0 | [link](https://github.com/csikasote/BembaSpeech) | V0 | 06/20/2022 |
22
+ | AMMI | Generic | Read | Swahili, Lingala, Bemba |MIT | [link](https://github.com/besacier/AMMIcourse/tree/master/STUDENTS-RETURN/Bemba) | V0 | 07/06/2020 |
23
+ | Waxal dataset | General | Spontaneous | Akan, Ewe |CC-BY-SA-4.0 | [link](https://github.com/Waxal-Multilingual/speech-data?tab=readme-ov-file) | V1.3 | 27/07/2020 |
24
+ | EthioSpeech | General | Read | Amharic, Oromo |ELRA END USER | [link](https://catalog.elra.info/en-us/repository/browse/ELRA-S0494/) | V1.0 | 21/03/2025 |
25
+ | Sagalee | General | Read | Oromo |CC BY-NC 4.0 | [link](https://github.com/turinaf/sagalee) | V0 | 28/11/2024 |