Found Better Dataset for testing

by kingabzpro - opened 4 days ago

4 days ago

I have identified two issues with https://huggingface.co/datasets/urdu-asr/csalt-voice. First, there are multiple speakers, which sometimes overlap. Second, the audio length is quite long, making it difficult to compare it accurately for the small model. For these reasons, I am switching to a better dataset: https://huggingface.co/datasets/HowMannyMore/urdu-audiodataset. What do you think?

I am running the testing script now; let's see if it improves the results.

sharjeel103

Owner 4 days ago

It appears that the dataset is simply a duplicate of the Urdu subset of the Mozilla Common Voice dataset.

It has been sourced from Mozilla's Common Voice, a publicly available voice dataset that relies on the contributions of volunteers from various parts of the world.