You have overfit your model on CV 17

#1
by kingabzpro - opened

I have been testing your model, and on the Common Voice 17 test subset, I obtained the following results:

  • WER: 0.209%
  • CER: 0.068%
  • BLEU: 99.483%
  • ChrF: 99.817

You can check the notebook here: https://www.kaggle.com/code/kingabzpro/testing-urdu-whisper-tiny.

With a WER of 0.2%, this is the best ASR model in the world. The best Urdu ASR model has a WER of around 25 WER. I was quite surprised, so I decided to test it with my own Urdu audio, but it failed miserably.

This is your model generation:

یہ رائیڈ سمیرا سی یا، کانشن سی یا، زمیر ہے یونیوں سی طوریوں کی انترک کے لنکے جواقیاتوں گے دوازار پندرہا میں گیاروں سوارات مارا دنیاا بعد ملک پاکستان تھا اب وہ پانچ شہصوں کے ایرامل دوگئی ایشائد مرتھ گئے۔

This is the original transcription:

یہ ویمنز موومنٹس سڑکوں پہ جو باہر آتی ہیں، یہ فاحشہ ہیں اور یہ مغرب کا کلچر پروموٹ کرنے آئیں ہوئی ہیں۔
تو پھر آپ کے اندر ایک فیصد بھی ڈیسنسی سینسِٹیوٹی، حساسیت یا ضمیر نہیں ہے۔
تو اس طرح کے آنر کلنگ کے واقعات ہوں گے۔
دو ہزار پندرہ میں قتل کرنے والا دنیا کا پہلا ملک پاکستان تھا۔
اب وہ تقریباً پانچ سو کے لگ بھگ ہو گئی ہیں—شاید مر گئے۔

made changes to the formating

I used the complete Common Voice dataset, including the test and validation sets, for training. As a result, evaluation on these sets does not provide an accurate reflection of the model's true performance.

That said, this is still the smallest available Whisper ASR model, and it delivers reasonably good results, especially when the audio is clean and clear. Given its size-to-performance ratio, it performs remarkably well.

Interestingly, more than half of its parameters are in the embedding layer. This means that, effectively, fewer than 18 million parameters are responsible for transcribing a text-rich language like Urdu.

It is not the best model overall, but for its intended purpose, it is good.

@sharjeel103 I wanted to clarify that I'm not trying to point out issues for the sake of criticism. However, the metrics shown in your README are incorrect. People might assume you have achieved the best results, but in reality, it falls short. I know the model is small, but could you please fix the Word Error Rate (WER) and Character Error Rate (CER)? It might also be helpful to run your model on another dataset to provide a better understanding of its performance.

How can you tell if it is performing well when it is trained on the same dataset and tested on that dataset? You need to test it on an unseen dataset to provide performance metrics.

I’m currently busy with another project. If you have a different dataset to test it on, feel free to contribute your evaluation and open a pull request to this model.
In the meantime, I’m removing the current metrics for now.

Yes. I will. Thank you. For understanding.

kingabzpro changed discussion status to closed

This is the result on the https://huggingface.co/datasets/urdu-asr/csalt-voice dataset.
→ WER: 64.961%
→ CER: 42.488%
→ BLEU: 16.710%
→ ChrF: 43.545

Sign up or log in to comment