--- library_name: transformers license: mit metrics: - name: wer type: wer value: 17.26 base_model: - mesolitica/wav2vec2-xls-r-300m-mixed pipeline_tag: automatic-speech-recognition tags: - wav2vec2 - asr - automatic-speech-recognition - malay - english - speech --- # Model Card for Malay-English Fine-Tuned ASR Model This model was fine-tuned on approximately 50 hours of manually curated Malay-English code-switched audio data for 10 epochs. It achieves a Word Error Rate (WER) of 17.26% on a held-out evaluation set vs 34.29% with base model. ## Model Details ### Model Description This is a fine-tuned version of the `mesolitica/wav2vec2-xls-r-300m-mixed` model on a custom Malay-English dataset. It is designed to transcribe speech that includes both Malay and English, especially in informal or conversational contexts where code-switching is common. - **Developed by:** mysterio - **Model type:** CTC-based automatic speech recognition - **Languages:** Malay, English - **License:** MIT - **Fine-tuned from:** https://huggingface.co/mesolitica/wav2vec2-xls-r-300m-mixed ### Model Sources - **Base Model:** [mesolitica/wav2vec2-xls-r-300m-mixed](https://huggingface.co/mesolitica/wav2vec2-xls-r-300m-mixed) ## Uses ### Direct Use This model can be used to transcribe conversational Malay-English audio recordings, especially in domains such as: - Broadcast interviews - YouTube vlogs - Podcasts - Community recordings ### Downstream Use The model can be fine-tuned further or used as part of downstream applications such as: - Real-time transcription services - Voice assistants tailored for Malaysian users - Speech-driven translation systems ### Out-of-Scope Use - High-stakes transcription scenarios (e.g., legal or medical contexts) where exact word accuracy is critical - Non-Malay, non-English languages - Noisy or far-field audio environments (unless fine-tuned further) ## Bias, Risks, and Limitations ### Known Limitations - May underperform on accents or dialects not well-represented in training data - Inconsistent casing or punctuation handling (model is CTC-based) - Limited robustness to background noise or overlapping speakers ### Recommendations - Always verify outputs for critical tasks - Pair with punctuation restoration or diarization for production-grade use - Retrain with domain-specific data for higher accuracy ## How to Get Started with the Model ```python from transformers import pipeline asr = pipeline("automatic-speech-recognition", model="langminer/wav2vec2-custom-asr") transcription = asr("your_audio_file.wav") print(transcription)