Breeze ASR 25

GitHub | Paper

Breeze ASR 25 是一款基於 Whisper-large-v2 開發的語音辨識模型，並具有以下特色：

強化繁體中文情境辨識能力
強化中英混用情境辨識能力，包含句內以及句外轉換
強化時間戳記對齊，適合自動字幕生成

Breeze ASR 25 is an advanced ASR model fine-tuned from Whisper-large-v2

Optimized for Taiwanese Mandarin
Optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
Enhanced time alignment, suitable for automatic captioning

Example:

增強範例-中英混用情境： MediaTek's 24th Anniversary

Breeze ASR 25:

面對不知道的我們怎麼用 open mind open heart 的心情去 explore
那 explore 過程也就是持續學習 不斷創新
當然如果能帶領 MediaTek 說達到這樣的 position
對做這樣的事情那覺得是一個 commitment
那也是一個 passion 那可以一直很努力的投入在做

Whisper-large-v2:

面對不知道的我們怎麼用開放心情去探索
把它探索過程也就是 仔細學習 不斷創新
當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情
那覺得是一個貢獻那也是一個熱誠
那可以一直來努力地投入在做

Performance

Word error rates of benchmarks. The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the paper as "Twister"

Short-form Audio Datasets

Dataset\Model	Language	WLV2-Auto ↓	WLV3-Auto ↓	COOL-Whisper ↓	Breeze ASR 25 (Ours) ↓
ASCEND-OVERALL*	Mixed	21.14	23.22	19.71	17.74 (-16.08%)
- ASCEND-EN	English	27.36	27.21	29.39	26.64 (-2.63%)
- ASCEND-ZH	Mandarin	17.49	17.41	18.90	16.04 (-8.29%)
- ASCEND-MIX*	Mixed	21.01	25.13	17.34	16.38 (-22.01%)
CommonVoice16-zh-TW	Mandarin	9.84	8.95	11.86	7.97 (-19%)
CSZS-zh-en*	Mixed	29.49	26.43	20.90	13.01 (-55.88%)

Long-form Audio Datasets

Dataset\Model	Language	WLV2-Auto ↓	WLV3-Auto ↓	COOL-Whisper ↓	Breeze ASR 25 (Ours) ↓
ML-lecture-2021-long*	Mandarin	6.13	6.41	6.37	4.98 (-18.76%)
Formosa-Go	Mandarin	15.03	14.90	16.83	13.61 (-9.44%)
Formosa-Show	Mandarin	29.18	27.80	29.78	27.58 (-5.48%)
Formosa-Course	Mandarin	9.50	9.67	11.12	9.94 (+0.44%)
Formosa-General	Mandarin	11.45	11.46	13.33	11.37 (-0.69%)
FormosaSpeech	Mandarin	22.34	21.22	26.71	22.09 (-1.12%)

* Code-switching datasets

Training Data

所有 Breeze ASR 25 的的訓練取樣自寬鬆自由軟體授權條款的數據集，中文部分完全採用合成語音資料：

The training data of Breeze ASR 25 is sampled from the following publicly available sources with permissive open-source licenses, where all Chinese data are synthetic:

Dataset Name	Type	Language	Total Hours	License
ODC Synth	Synthetic	Mandarin	10,000	Open Data Commons License Attribution + Apache2.0*
CommonVoice17-EN	Real	English	1,738	Creative Commons Zero
NTUML2021	Real	Code-switching	11	MIT License

*ODC Synth is generated by using text from FineWeb2 (ODC License) and a TTS model BreezyVoice (Apache2.0 License)

Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our paper.

🔧 Usage Example

字幕檔生成，請參考 GitHub Please refer to the GitHub for subtitles generation.

For quick testing, the whisper architecture is supported in Hugging Face 🤗 Transformers. First, install relavant packages:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate

The model can be used with the pipeline class to transcribe audios of arbitrary length: Simple change input_audio.wav in the following example to the actual filename of your audio.

import torchaudio
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline

# 1. Load audio
audio_path = "./input_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)          

# 2. Preprocess
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0)                         
waveform = waveform.squeeze().numpy()                        

if sample_rate != 16_000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16_000)
    waveform = resampler(torch.tensor(waveform)).numpy()
    sample_rate = 16_000

# 3. Load Model
processor = WhisperProcessor.from_pretrained("MediaTek-Research/Breeze-ASR-25")
model = WhisperForConditionalGeneration.from_pretrained("MediaTek-Research/Breeze-ASR-25").to("cuda").eval()

# 4. Build Pipeline
asr_pipeline = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=0
)

# 6. Inference
output = asr_pipeline(waveform, return_timestamps=True)  
print("Result:", output["text"])

You can obtain a wav file for testing by loading from a benchmark:

from datasets import load_dataset
import torch
import torchaudio


ds = load_dataset("ky552/ML2021_ASR_ST", split="test")
sample = ds[1279]["audio"]

audio_array = sample["array"]
sampling_rate = sample["sampling_rate"]

waveform = torch.tensor(audio_array).unsqueeze(0)

torchaudio.save("input_audio.wav", waveform, sampling_rate)

# Decoding Results:
# Breeze ASR 25: "放進你的 training 裡面" (correct)
# Whisper: "放進你的權利裡面"

Acknowledgements

We thank NVIDIA for providing access to the Taipei-1 supercomputer.

We thank Professor Hung-yi Lee for his valuable guidance on this project.

📜 Citation

If you find this model useful, please cite our work:

Cheng-Kang Chou*, Chan-Jan Hsu*, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

*Equal contribution

@article{chou2025selfrefiningframeworkenhancingasr,
  title={A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data},
  author={Cheng Kang Chou and Chan-Jan Hsu and Ho-Lam Chung and Liang-Hsuan Tseng and Hsi-Chun Cheng and Yu-Kuan Fu and Kuan Po Huang and Hung-Yi Lee},
  journal={arXiv preprint arXiv:2506.11130},
  year={2025}
}

MediaTek-Research
/

Breeze-ASR-25

Breeze ASR 25

Example:

Performance

Short-form Audio Datasets

Long-form Audio Datasets

Training Data

🔧 Usage Example

Acknowledgements

📜 Citation

Model tree for MediaTek-Research/Breeze-ASR-25

Space using MediaTek-Research/Breeze-ASR-25 1

Collection including MediaTek-Research/Breeze-ASR-25

BreezeASR