wav2vec2-custom-asr / README.md
langminer's picture
Update README.md
6ae7cb7 verified
---
library_name: transformers
license: mit
metrics:
- name: wer
type: wer
value: 17.26
base_model:
- mesolitica/wav2vec2-xls-r-300m-mixed
pipeline_tag: automatic-speech-recognition
tags:
- wav2vec2
- asr
- automatic-speech-recognition
- malay
- english
- speech
---
# Model Card for Malay-English Fine-Tuned ASR Model
This model was fine-tuned on approximately 50 hours of manually curated Malay-English code-switched audio data for 10 epochs. It achieves a Word Error Rate (WER) of 17.26% on a held-out evaluation set vs 34.29% with base model.
## Model Details
### Model Description
This is a fine-tuned version of the `mesolitica/wav2vec2-xls-r-300m-mixed` model on a custom Malay-English dataset. It is designed to transcribe speech that includes both Malay and English, especially in informal or conversational contexts where code-switching is common.
- **Developed by:** mysterio
- **Model type:** CTC-based automatic speech recognition
- **Languages:** Malay, English
- **License:** MIT
- **Fine-tuned from:** https://huggingface.co/mesolitica/wav2vec2-xls-r-300m-mixed
### Model Sources
- **Base Model:** [mesolitica/wav2vec2-xls-r-300m-mixed](https://huggingface.co/mesolitica/wav2vec2-xls-r-300m-mixed)
## Uses
### Direct Use
This model can be used to transcribe conversational Malay-English audio recordings, especially in domains such as:
- Broadcast interviews
- YouTube vlogs
- Podcasts
- Community recordings
### Downstream Use
The model can be fine-tuned further or used as part of downstream applications such as:
- Real-time transcription services
- Voice assistants tailored for Malaysian users
- Speech-driven translation systems
### Out-of-Scope Use
- High-stakes transcription scenarios (e.g., legal or medical contexts) where exact word accuracy is critical
- Non-Malay, non-English languages
- Noisy or far-field audio environments (unless fine-tuned further)
## Bias, Risks, and Limitations
### Known Limitations
- May underperform on accents or dialects not well-represented in training data
- Inconsistent casing or punctuation handling (model is CTC-based)
- Limited robustness to background noise or overlapping speakers
### Recommendations
- Always verify outputs for critical tasks
- Pair with punctuation restoration or diarization for production-grade use
- Retrain with domain-specific data for higher accuracy
## How to Get Started with the Model
```python
from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="langminer/wav2vec2-custom-asr")
transcription = asr("your_audio_file.wav")
print(transcription)