--- license: apache-2.0 language: - zh - en base_model: - openai/whisper-large-v2 --- # Breeze ASR 25 Breeze ASR 25 [GitHub](https://github.com/mtkresearch/Breeze-ASR-25) | [Paper](https://arxiv.org/pdf/2506.11130) **Breeze ASR 25** 是一款基於 Whisper-large-v2 開發的語音辨識模型,並具有以下特色: - 強化繁體中文情境辨識能力 - 強化中英混用情境辨識能力,包含句內以及句外轉換 - 強化時間戳記對齊,適合自動字幕生成 **Breeze ASR 25** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper) - Optimized for Taiwanese Mandarin - Optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching. - Enhanced time alignment, suitable for automatic captioning --- ## Example: 增強範例-中英混用情境: [MediaTek's 24th Anniversary](https://www.youtube.com/watch?v=YkUv5qyhVhw&t=261s) Breeze ASR 25: ``` 面對不知道的我們怎麼用 open mind open heart 的心情去 explore 那 explore 過程也就是持續學習 不斷創新 當然如果能帶領 MediaTek 說達到這樣的 position 對做這樣的事情那覺得是一個 commitment 那也是一個 passion 那可以一直很努力的投入在做 ``` Whisper-large-v2: ``` 面對不知道的我們怎麼用開放心情去探索 把它探索過程也就是 仔細學習 不斷創新 當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情 那覺得是一個貢獻那也是一個熱誠 那可以一直來努力地投入在做 ``` --- ## Performance Word error rates of benchmarks. The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the [paper](https://arxiv.org/pdf/2506.11130) as "Twister" ### Short-form Audio Datasets | Dataset\Model | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ | |---------------------------|---------------|-------------|-------------|----------------|------------------| | ASCEND-OVERALL* | Mixed | 21.14 | 23.22 | 19.71 | **17.74** (-16.08%) | | - ASCEND-EN | English | 27.36 | 27.21 | 29.39 | **26.64** (-2.63%) | | - ASCEND-ZH | Mandarin | 17.49 | 17.41 | 18.90 | **16.04** (-8.29%) | | - ASCEND-MIX* | Mixed | 21.01 | 25.13 | 17.34 | **16.38** (-22.01%) | | CommonVoice16-zh-TW | Mandarin | 9.84 | 8.95 | 11.86 | **7.97** (-19%) | | CSZS-zh-en* | Mixed | 29.49 | 26.43 | 20.90 | **13.01** (-55.88%) | ### Long-form Audio Datasets | Dataset\Model | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ | |---------------------------|---------------|-------------|-------------|----------------|------------------| | ML-lecture-2021-long* | Mandarin | 6.13 | 6.41 | 6.37 | **4.98** (-18.76%) | | Formosa-Go | Mandarin | 15.03 | 14.90 | 16.83 | **13.61** (-9.44%) | | Formosa-Show | Mandarin | 29.18 | 27.80 | 29.78 | **27.58** (-5.48%) | | Formosa-Course | Mandarin | **9.50** | 9.67 | 11.12 | 9.94 (+0.44%) | | Formosa-General | Mandarin | 11.45 | 11.46 | 13.33 | **11.37** (-0.69%) | | FormosaSpeech | Mandarin | 22.34 | 21.22 | 26.71 | **22.09** (-1.12%) | \* Code-switching datasets --- ## Training Data 所有 Breeze ASR 25 的的訓練取樣自**寬鬆自由軟體授權條款**的數據集,中文部分完全採用合成語音資料: The training data of Breeze ASR 25 is sampled from the following publicly available sources with **permissive open-source licenses**, where all Chinese data are synthetic: | Dataset Name | Type | Language | Total Hours | License | |------------------------------------------------------------------------------|--------|-----------------|-------------|---------| | ODC Synth | Synthetic | Mandarin | 10,000 | Open Data Commons License Attribution + Apache2.0* | | [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real | English | 1,738 | Creative Commons Zero | | [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST) | Real | Code-switching | 11 | MIT License | *ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License) Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our [paper](https://arxiv.org/pdf/2506.11130). --- ## 🔧 Usage Example 字幕檔生成,請參考 [GitHub](https://github.com/mtkresearch/Breeze-ASR-25) Please refer to the [GitHub](https://github.com/mtkresearch/Breeze-ASR-25) for subtitles generation. For quick testing, the whisper architecture is supported in Hugging Face 🤗 Transformers. First, install relavant packages: ``` pip install --upgrade pip pip install --upgrade transformers datasets[audio] accelerate ``` The model can be used with the pipeline class to transcribe audios of arbitrary length: Simple change `input_audio.wav` in the following example to the actual filename of your audio. ```python import torchaudio import torch from transformers import WhisperProcessor, WhisperForConditionalGeneration, AutomaticSpeechRecognitionPipeline # 1. Load audio audio_path = "./input_audio.wav" waveform, sample_rate = torchaudio.load(audio_path) # 2. Preprocess if waveform.shape[0] > 1: waveform = waveform.mean(dim=0) waveform = waveform.squeeze().numpy() if sample_rate != 16_000: resampler = torchaudio.transforms.Resample(sample_rate, 16_000) waveform = resampler(torch.tensor(waveform)).numpy() sample_rate = 16_000 # 3. Load Model processor = WhisperProcessor.from_pretrained("MediaTek-Research/Breeze-ASR-25") model = WhisperForConditionalGeneration.from_pretrained("MediaTek-Research/Breeze-ASR-25").to("cuda").eval() # 4. Build Pipeline asr_pipeline = AutomaticSpeechRecognitionPipeline( model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, chunk_length_s=0 ) # 6. Inference output = asr_pipeline(waveform, return_timestamps=True) print("Result:", output["text"]) ``` You can obtain a wav file for testing by loading from a benchmark: ```python from datasets import load_dataset import torch import torchaudio ds = load_dataset("ky552/ML2021_ASR_ST", split="test") sample = ds[1279]["audio"] audio_array = sample["array"] sampling_rate = sample["sampling_rate"] waveform = torch.tensor(audio_array).unsqueeze(0) torchaudio.save("input_audio.wav", waveform, sampling_rate) # Decoding Results: # Breeze ASR 25: "放進你的 training 裡面" (correct) # Whisper: "放進你的權利裡面" ``` --- ## Acknowledgements We thank NVIDIA for providing access to the Taipei-1 supercomputer. We thank Professor Hung-yi Lee for his valuable guidance on this project. --- ## 📜 Citation If you find this model useful, please cite our work: **Cheng-Kang Chou\***, **Chan-Jan Hsu\***, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-yi Lee [*A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data*](https://arxiv.org/pdf/2506.11130) \*Equal contribution ```bibtex @article{chou2025selfrefiningframeworkenhancingasr, title={A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data}, author={Cheng Kang Chou and Chan-Jan Hsu and Ho-Lam Chung and Liang-Hsuan Tseng and Hsi-Chun Cheng and Yu-Kuan Fu and Kuan Po Huang and Hung-Yi Lee}, journal={arXiv preprint arXiv:2506.11130}, year={2025} } ```