|
--- |
|
library_name: transformers |
|
license: mit |
|
metrics: |
|
- name: wer |
|
type: wer |
|
value: 17.26 |
|
base_model: |
|
- mesolitica/wav2vec2-xls-r-300m-mixed |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- wav2vec2 |
|
- asr |
|
- automatic-speech-recognition |
|
- malay |
|
- english |
|
- speech |
|
--- |
|
|
|
# Model Card for Malay-English Fine-Tuned ASR Model |
|
|
|
This model was fine-tuned on approximately 50 hours of manually curated Malay-English code-switched audio data for 10 epochs. It achieves a Word Error Rate (WER) of 17.26% on a held-out evaluation set vs 34.29% with base model. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This is a fine-tuned version of the `mesolitica/wav2vec2-xls-r-300m-mixed` model on a custom Malay-English dataset. It is designed to transcribe speech that includes both Malay and English, especially in informal or conversational contexts where code-switching is common. |
|
|
|
- **Developed by:** mysterio |
|
- **Model type:** CTC-based automatic speech recognition |
|
- **Languages:** Malay, English |
|
- **License:** MIT |
|
- **Fine-tuned from:** https://huggingface.co/mesolitica/wav2vec2-xls-r-300m-mixed |
|
|
|
### Model Sources |
|
|
|
- **Base Model:** [mesolitica/wav2vec2-xls-r-300m-mixed](https://huggingface.co/mesolitica/wav2vec2-xls-r-300m-mixed) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model can be used to transcribe conversational Malay-English audio recordings, especially in domains such as: |
|
- Broadcast interviews |
|
- YouTube vlogs |
|
- Podcasts |
|
- Community recordings |
|
|
|
### Downstream Use |
|
|
|
The model can be fine-tuned further or used as part of downstream applications such as: |
|
- Real-time transcription services |
|
- Voice assistants tailored for Malaysian users |
|
- Speech-driven translation systems |
|
|
|
### Out-of-Scope Use |
|
|
|
- High-stakes transcription scenarios (e.g., legal or medical contexts) where exact word accuracy is critical |
|
- Non-Malay, non-English languages |
|
- Noisy or far-field audio environments (unless fine-tuned further) |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
### Known Limitations |
|
|
|
- May underperform on accents or dialects not well-represented in training data |
|
- Inconsistent casing or punctuation handling (model is CTC-based) |
|
- Limited robustness to background noise or overlapping speakers |
|
|
|
### Recommendations |
|
|
|
- Always verify outputs for critical tasks |
|
- Pair with punctuation restoration or diarization for production-grade use |
|
- Retrain with domain-specific data for higher accuracy |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
asr = pipeline("automatic-speech-recognition", model="langminer/wav2vec2-custom-asr") |
|
transcription = asr("your_audio_file.wav") |
|
print(transcription) |