---
license: other
datasets:
- MERaLiON/Multitask-National-Speech-Corpus-v1
language:
- en
- zh
- ms
- ta
- id
- th
- vi
metrics:
- wer
- bleu
base_model:
- openai/whisper-large-v3
- google/gemma-2-9b-it
library_name: transformers
tags:
- meralion
- meralion-2
---
🔥 MERaLiON-2 🔥
🚀 MERaLiON-2-10B |
🚀 MERaLiON-2-10B-ASR |
🚀 MERaLiON-2-3B
đź’» Web Demo |
⚙️ vLLM
## Introduction
We are pleased to announce the release of **MERaLiON2**, the latest addition to the MERaLiON family of speech-text large language models. Our flagship model, [**MERaLiON-2-10B**](https://huggingface.co/MERaLiON/MERaLiON-2-10B), demonstrates competitive performance across benchmark evaluations in tasks such as multilingual automatic speech recognition (ASR), speech translation (ST), audio scene understanding, emotion recognition, and general speech comprehension. These results are comparable to those achieved by other state-of-the-art open-source AudioLLMs, including Qwen2.5-Omni-7B and Phi-4-multimodal-instruct.
MERaLiON-2-10B is specifically designed to follow complex instructions with a nuanced understanding of **Singapore’s multilingual and multicultural context**. It integrates a localized Whisper-large-v3 speech encoder and Gemma-2-9b text decoder. The following graph presents task-specific evaluation scores, assessed using the **LLM-as-a-Judge** framework across multiple datasets. For the speech translation task, performance is measured using the BLEU metric, where higher scores indicate better translation quality.
In addition, we introduce an ASR-optimized variant, [**MERaLiON-2-10B-ASR**](https://huggingface.co/MERaLiON/MERaLiON-2-10B-ASR), which delivers a **5–30%** performance improvement over OpenAI’s `whisper-large-v3` on speech recognition tasks. This enhancement spans Singapore’s 4 official languages—**English**, **Mandarin**, **Malay**, and **Tamil**—as well as 3 South-East Asian languages: **Indonesian**, **Thai**, and **Vietnamese**. The model also demonstrates robust handling of **code-switching scenarios** and local colloquialisms, reflecting its adaptability to Singapore’s diverse linguistic landscape.
The following visualization illustrates the **1 - Word Error Rate (WER)** metric across these seven languages, comparing MERaLiON-2-10B-ASR with other leading models. A higher value indicates better transcription accuracy.
We also provide [MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) that balances performance with reduced computational requirements, enabling broader accessibility and lightweight deployment.
- **Extended Audio Length**: Support audio inputs up to 300 seconds (5 minutes) for audio & speech question answering tasks, **30s for a satisfactory performance for speech transcription (ASR) and speech translation (ST) tasks**.
- **Expanded Language Coverage**: In addition to English, Chinese, and Singlish, V2 introduces support for Malay, Tamil, and other South-East Asia languages including Indonesian, Thai, and Vietnamese.
- **Improved Performance**: Achieves higher performance across a wide range of tasks. See the [Evaluation](#performance) section for detailed benchmarks.
- **Higher Quality Training Data**: Trained on 120,000 hours of curated speech and audio data, filtered for quality and diversity, with an emphasis on local and multilingual audio sources.
- **Three Model Variants**: Available in general-purpose ([MERaLiON-2-10B](https://huggingface.co/MERaLiON/MERaLiON-2-10B)), ASR-optimized ([MERaLiON-2-10B-ASR](https://huggingface.co/MERaLiON/MERaLiON-2-10B-ASR)) and light-weight ([MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B)) configurations to balance latency, compute efficiency, and task performance across different deployment needs.
## Model Description:
MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**earning **i**n **O**ne **N**etwork.
MERaLiON-2 is a family of Speech-Text Large Language Models tailored for **Singapore’s multilingual and multicultural landscape**, as well as the wider **Southeast Asian region**.
The 10B model integrates a localized [Whisper-Large-V3](https://huggingface.co/openai/whisper-large-v3) speech encoder with the [Gemma2-9b-IT](https://huggingface.co/google/gemma-2-9b-it) text decoder.
The 3B model integrates a localized [Whisper-Large-V3](https://huggingface.co/openai/whisper-large-v3) speech encoder with the [Gemma2-2b-IT](https://huggingface.co/google/gemma-2-2b-it) text decoder.
MERaLiON-2-10B is finetuned on **120,000 hours of speech and audio data** across **6 diverse tasks**: Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), Audio Captioning (AC), Audio-Scene Question Answering (ASQA) and Paralinguistic Question Answering (PQA).
The model supports long-form audio inputs of up to 300 seconds (5 minutes) and is specifically adapted to handle the linguistic nuances, accents, and dialects commonly found across Singapore and neighboring countries.
- **Developed by:** I2R, A\*STAR, Singapore
- **Model type:** Multimodal LLM
- **Language(s):** Primarily English (Global and Singapore), Chinese, with support for audio of regional languages including Malay, Tamil, Indonesian, Thai, and Vietnamese.
- **Audio:** **Mono** channel audio, **16000** hz, up to **300** seconds.
- **License:** [MERaLiON Public License](MERaLiON-Public-Licence-v2.pdf)
- **Demo:** [MERaLiON-AudioLLM Web Demo](https://meralion.org/demo/)
**MERaLiON-2** is an upgraded version of [MERaLiON-AudioLLM](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION).
## Performance:
We benchmark MERaLiON-2 series models with extended [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) against several recently released open-source multimodal models — SALMONN-7B, Qwen2.5-Omni series and Phi-4-Multimodal — as well as two cascade model.
**Better Automatic Speech Recognition (ASR) Accuracy**
MERaLiON-2-10B-ASR and MERaLiON-2-10B demonstrate leading performance in Singlish, Mandarin, Malay, Tamil, and other Southeast Asian languages, while maintaining competitive results in English compared to `Whisper-large-v3`. The following table shows the average transcription `Word Error Rate` by language for the MERaLiON family and other leading AudioLLMs. The `Private Dataset` includes a collection of Singapore's locally accented speeches with code-switch.
Please visit [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) for dataset-level evaluation results.
|
MERaLiON-2-10B-ASR |
MERaLiON-2-10B |
MERaLiON-2-3B |
whisper_large_v3 |
cascade-whisper_large_v3-llama_3_8b_instruct |
cascade-whisper_large_v2-gemma2_9b_cpt-sea_lionv3_instruct |
MERaLiON-AudioLLM-Whisper-SEA-LION |
Qwen2.5-Omni-7B |
SeaLLMs-Audio-7B |
Qwen2.5-Omni-3B |
SALMONN_7B |
phi_4_multimodal_instruct |
Thai |
0.096526 |
0.109365 |
0.107279 |
0.121073 |
0.120257 |
0.172105 |
0.919330 |
0.126497 |
0.117152 |
0.163150 |
1.191099 |
1.510068 |
Tamil |
0.271279 |
0.327081 |
0.344081 |
0.441483 |
0.475225 |
0.492336 |
0.561315 |
1.024916 |
2.325402 |
1.315143 |
1.306694 |
1.876722 |
Singlish |
0.129830 |
0.168813 |
0.180395 |
0.248945 |
0.251608 |
0.255717 |
0.143800 |
0.439071 |
0.795990 |
0.389393 |
0.441490 |
0.448863 |
Malay |
0.194638 |
0.209074 |
0.279891 |
0.219692 |
0.311921 |
0.314378 |
0.289895 |
1.460664 |
0.765565 |
2.943750 |
1.085867 |
3.762933 |
English |
0.078544 |
0.088259 |
0.122295 |
0.080841 |
0.081568 |
0.104830 |
0.110567 |
0.134216 |
0.197824 |
0.110353 |
0.191492 |
0.098225 |
Indonesian |
0.121020 |
0.142813 |
0.131950 |
0.137102 |
0.135390 |
0.159476 |
0.298365 |
0.168659 |
0.220227 |
0.205216 |
1.653502 |
3.565510 |
Mandarian |
0.103694 |
0.132025 |
0.145878 |
0.170980 |
0.196867 |
0.291733 |
0.291183 |
0.102419 |
0.309782 |
0.130429 |
0.939545 |
0.238879 |
Vietnamese |
0.118693 |
0.134808 |
0.155110 |
0.148474 |
0.136075 |
0.164078 |
0.952040 |
0.205491 |
0.222001 |
0.186786 |
1.521174 |
1.805643 |
Private Dataset |
0.106150 |
0.112360 |
0.147258 |
0.116630 |
0.118434 |
0.143812 |
0.130667 |
0.222770 |
0.496540 |
0.164556 |
0.273304 |
0.229450 |
**Better Instruction Following and Audio Understanding**
**MERaLiON-2-10B** exhibits substantial advancements in speech and audio understanding, as well as paralinguistic tasks. Notably, it adeptly handles complex instructions and responds with enhanced flexibility, effectively preserving the pre-trained knowledge from Gemma during the audio fine-tuning process. This capability enables MERaLiON-2-10B to provide detailed explanations regarding speech content and the speaker's emotional state. Furthermore, with appropriate prompt adjustments, the model can assume various roles, such as a voice assistant, virtual caregiver, or an integral component of sophisticated multi-agent AI systems and software solutions.
Please visit [AudioBench benchmark](https://huggingface.co/spaces/MERaLiON/AudioBench-Leaderboard) for dataset-level evaluation results.
|
MERaLiON-2-10B |
MERaLiON-AudioLLM-Whisper-SEA-LION |
MERaLiON-2-10B-ASR |
MERaLiON-2-3B |
SeaLLMs-Audio-7B |
Qwen2-Audio-7B-Instruct |
Qwen2.5-Omni-3B |
phi_4_multimodal_instruct |
cascade-whisper_large_v3-llama_3_8b_instruct |
Qwen2.5-Omni-7B |
cascade-whisper_large_v2-gemma2_9b_cpt-sea_lionv3_instruct |
Qwen-Audio-Chat |
SALMONN_7B |
WavLLM_fairseq |
Speech Instruction |
70.200000 |
70.800000 |
13.400000 |
19.100000 |
66.900000 |
48.700000 |
65.000000 |
36.200000 |
66.100000 |
58.300000 |
72.900000 |
10.200000 |
12.900000 |
20.400000 |
Emotion Recognition |
63.736268 |
48.577313 |
53.693298 |
54.040797 |
52.007576 |
49.846540 |
33.037836 |
40.677800 |
50.937578 |
31.469397 |
48.214969 |
41.671551 |
33.584869 |
50.801545 |
Audio Scene Question Answering |
51.140374 |
52.207756 |
49.511886 |
46.141353 |
50.193739 |
47.048025 |
48.123228 |
42.217143 |
21.876943 |
45.669153 |
18.043681 |
51.618622 |
51.816958 |
33.034083 |
Gender Recognition |
95.109423 |
97.177396 |
97.220335 |
93.810266 |
75.449392 |
95.963266 |
47.867210 |
70.718047 |
57.039409 |
48.724711 |
19.421130 |
60.349349 |
84.365092 |
60.773275 |
Spoken QA (Singlish) |
66.550000 |
58.900000 |
61.850000 |
59.700000 |
51.350000 |
46.700000 |
60.500000 |
61.950000 |
59.350000 |
58.400000 |
53.750000 |
42.300000 |
43.200000 |
51.200000 |
Audio Captioning |
35.604270 |
36.976419 |
34.466710 |
33.243839 |
45.089372 |
37.278810 |
39.200328 |
30.832409 |
2.915778 |
31.896243 |
3.140568 |
39.988663 |
28.880570 |
6.200867 |
Spoken Dialogue Summarisation |
53.100000 |
53.600000 |
55.800000 |
48.550000 |
45.450000 |
36.300000 |
46.750000 |
50.750000 |
45.850000 |
43.150000 |
51.000000 |
25.250000 |
14.400000 |
39.450000 |
Spoken QA (English) |
79.735049 |
63.711481 |
73.975834 |
68.715179 |
70.920519 |
68.888565 |
67.818546 |
75.513152 |
78.526569 |
68.415131 |
67.814538 |
66.069047 |
60.649071 |
70.595242 |
Music Understanding |
63.942713 |
51.347936 |
60.657119 |
55.602359 |
63.689975 |
71.609099 |
59.309183 |
55.265375 |
56.697557 |
47.598989 |
50.463353 |
59.056445 |
49.705139 |
44.313395 |
Accent Recognition |
41.815396 |
43.799799 |
47.788864 |
60.054981 |
10.143836 |
10.901397 |
0.478694 |
3.097615 |
21.398482 |
0.587293 |
25.929693 |
17.550294 |
11.577381 |
14.294613 |
Speech Translation |
27.391115 |
27.086366 |
28.540359 |
22.130258 |
21.143215 |
10.826666 |
21.776628 |
13.827110 |
13.536272 |
20.688241 |
21.437997 |
4.973184 |
13.486003 |
9.046791 |
## How to Use
> [!WARNING]
> **Out of Scope use**: This model is not intended for use in tool calling, math, and coding tasks.
MERaLiON-2 requires `transformers` version `4.50.1`
```
pip install transformers==4.50.1
pip install librosa
```
To run in GPU, MERaLiON-2 requires `flash-attn`.
```
pip install flash-attn --no-build-isolation
```
> [!TIP]
> Should you face any difficulties installing the above packages, you can refer to this docker:
> `pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel` as an example for a pre-built torch & cuda environment.
### Audio Input
- For ASR tasks, the maximum audio length is suggested to be 30 seconds at 16,000 Hz.
- For general speech & audio understanding tasks, the maximum audio length is suggested to be 300 seconds at 16,000 Hz sampling rate.
### Text Prompt
MERaLiON-2 is trained with this prompt template:
```
Instruction: \nFollow the text instruction based on the following audio:
```
It is generally recommended to follow this template, i.e., replace `` with your text instruction while leaving the `` untouched. We list a few useful example prompts here:
**Standard prompts for better accuracy**
```python
prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: "
transcription_prompt = prompt_template.format(query="Please transcribe the speech")
translation_prompt = prompt_template.format(query="Please translate the speech into Malay")
summarization_prompt = prompt_template.format(query="Please summarize this speech")
audio_captioning_prompt_1 = prompt_template.format(query="Please describe the audio")
audio_captioning_prompt_2 = prompt_template.format(query="Please create a caption for the audio")
audio_scene_understanding_prompt = prompt_template.format(query="Is there people crying in the audio?")
speech_as_instruction_prompt = prompt_template.format(query="Please respond to the audio") # given an speech instruction is provided in the audio clip.
emotion_recognition_prompt_1 = prompt_template.format(query="What is the emotion of the speaker")
emotion_recognition_prompt_2 = prompt_template.format(query="Describe the paralinguistics feature of the audio")
gender_recognition_prompt = prompt_template.format(query="What is the gender of the speaker")
```
**More flexible prompts for enriched responses**
```python
prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: "
prompt_1 = prompt_template.format(query="describe the paralinguistics feature and return in json format.")
prompt_2 = prompt_template.format(query="Please summarise the content of the speech and analyse the paralinguistics features of this audio. Return in json format.")
prompt_3 = prompt_template.format(query="Please translate this speech to Singapore's 4 official languages.")
```
**AI agent prompts (beyond the default prompt template)**
```python
prompt_1 = \
"""
Your are MERaLiON-AudioLLM, an empathic AI assistant developed by A*STAR. MERaLiON stands for Multimodal Empathetic Reasoning and Learning in One Network.
You are a friendly and empathetic conversational partner, and is proficient in understanding human's emotion, accent, and gender from paralinguistic features.
Maintain a tone that is warm, non-judgmental, and supportive while replying to user.
User's voice:
"""
```
### Huggingface Inference with CPU
```python
import librosa
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
repo_id = "MERaLiON/MERaLiON-2-3B"
processor = AutoProcessor.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
repo_id,
use_safetensors=True,
trust_remote_code=True,
)
prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: "
transcribe_prompt = "Please transcribe this speech."
translate_prompt = "Can you please translate this speech into written Chinese?"
# batch inference of 2 samples
conversation = [
[{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}],
[{"role": "user", "content": prompt_template.format(query=translate_prompt)}],
]
chat_prompt = processor.tokenizer.apply_chat_template(
conversation=conversation,
tokenize=False,
add_generation_prompt=True
)
# Use audio at 16000hz.
audio_array, sample_rate = librosa.load("/path/to/your/audio/file", sr=16000)
audio_array = [audio_array]*2
inputs = processor(text=chat_prompt, audios=audio_array)
# adjust the `max_new_tokens` based on your use case.
outputs = model.generate(**inputs, max_new_tokens=256)
generated_ids = outputs[:, inputs['input_ids'].size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)
```
### Huggingface GPU Inference
```python
import torch
import librosa
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
repo_id = "MERaLiON/MERaLiON-2-3B"
device = "cuda"
processor = AutoProcessor.from_pretrained(
repo_id,
trust_remote_code=True,
)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
repo_id,
use_safetensors=True,
trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16
).to(device)
prompt_template = "Instruction: {query} \nFollow the text instruction based on the following audio: "
transcribe_prompt = "Please transcribe this speech."
translate_prompt = "Can you please translate this speech into written Chinese?"
# batch inference of 2 samples
conversation = [
[{"role": "user", "content": prompt_template.format(query=transcribe_prompt)}],
[{"role": "user", "content": prompt_template.format(query=translate_prompt)}],
]
chat_prompt = processor.tokenizer.apply_chat_template(
conversation=conversation,
tokenize=False,
add_generation_prompt=True
)
# Use audio at 16000hz.
audio_array, sample_rate = librosa.load("/path/to/your/audio/file", sr=16000)
audio_array = [audio_array]*2
inputs = processor(text=chat_prompt, audios=audio_array)
for key, value in inputs.items():
if isinstance(value, torch.Tensor):
inputs[key] = inputs[key].to(device)
if value.dtype == torch.float32:
inputs[key] = inputs[key].to(torch.bfloat16)
# adjust the `max_new_tokens` based on your use case.
outputs = model.generate(**inputs, max_new_tokens=256)
generated_ids = outputs[:, inputs['input_ids'].size(1):]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)
```
## ⚠️ Disclaimer
The current MERaLiON-2 has not been specifically aligned for safety and may generate content that is inappropriate, offensive, or harmful. Developers and users are responsible for performing their own safety fine-tuning and implementing necessary security measures. The authors shall not be held liable for any claims, damages, or other liabilities arising from the use of the released models, weights, or code.
### Compute and Infrastructure
MERaLiON-2 was trained on the [**ASPIRE 2A+**](https://help.nscc.sg/aspire2aplus/about/) Supercomputer Cluster, provided by [**National Supercomputing Centre (NSCC)**](https://www.nscc.sg/), Singapore. ASPIRE 2A+ cluster provides multiple H100 nodes, with each compute node equipped with 8 Nvidia H100 GPUs, 2 TB of RAM, and 30 TB of locally attached NVMe storage. These nodes are interconnected via a rail-optimised, full fat-tree topology, utilising 400 Gb/s NDR InfiniBand cables. Additionally, the cluster incorporates a 2.5 PB SSD-based Lustre file system, linked to the H100 nodes through high-speed InfiniBand connections.
With a global batch size of 768, we trained the current release of MERaLiON-2 for around 200k steps, which took around 2 days to complete using 16 nodes, 128 H100 GPUs.
## 📚 Citation
If you find our work useful, please cite our papers:
[MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models](https://arxiv.org/abs/2412.09818)
[AudioBench: A Universal Benchmark for Audio Large Language Models](https://aclanthology.org/2025.naacl-long.218/)
[Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models](https://arxiv.org/abs/2501.01034)
[MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders](https://arxiv.org/abs/2409.06635)
[MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish](https://arxiv.org/abs/2501.08335)
```
@misc{he2024meralionaudiollmtechnicalreport,
title={MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models},
author={{MERaLiON Team}},
year={2024},
eprint={2412.09818},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.09818},
}
```
```
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}
```
```
@article{wang2025advancing,
title={Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models},
author={Wang, Bin and Zou, Xunlong and Sun, Shuo and Zhang, Wenyu and He, Yingxu and Liu, Zhuohan and Wei, Chengwei and Chen, Nancy F and Aw, AiTi},
journal={arXiv preprint arXiv:2501.01034},
year={2025}
}
```
```
@article{zhang2024mowe,
title={MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders},
author={Zhang, Wenyu and Sun, Shuo and Wang, Bin and Zou, Xunlong and Liu, Zhuohan and He, Yingxu and Lin, Geyu and Chen, Nancy F and Aw, Ai Ti},
journal={ICASSP},
year={2025}
}
```
```
@misc{huang2025meraliontextllmcrosslingualunderstandinglarge,
title={MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish},
author={Xin Huang and Tarun Kumar Vangani and Minh Duc Pham and Xunlong Zou and Bin Wang and Zhengyuan Liu and Ai Ti Aw},
year={2025},
eprint={2501.08335},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.08335},
}
```