MediaTek-Research
/

Breeze-ASR-25

Safetensors

Chinese

English

whisper

Model card Files Files and versions Community

Splend1dchan commited on 3 days ago

Commit

ee2e04a

verified ·

1 Parent(s): 5542f59

Update README.md

Browse files

Files changed (1) hide show

README.md +73 -36

README.md CHANGED Viewed

@@ -6,54 +6,99 @@ language:
 base_model:
 - openai/whisper-large-v2
 ---
-# Twister
-<img src="./twister.png" alt="Twister" width="700"/>
-**Twister** 是一個針對繁體中文以及中英交錯情境增強的語音辨識模型。**Twister** 基於 Whisper-large-v2 之上訓練而成，其中文部分完全採用合成語音資料進行訓練。
-**Twister** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper) with TTS-sythesized data, specially optimized for Taiwanese Mandarin and Mandarin-English code-switching scenarios.
 ---
 ## Example:
-Twister:
-面對不知道的我們怎麼用 open mind open heart 的心情去 explore 那 explore 過程也就是持續學習 不斷創新 當然如果能帶領 MediaTek 說達到這樣的 position 對做這樣的事情那覺得是一個 commitment 那也是一個 passion 那可以一直很努力的投入在做
-Whisper:
-我們怎麼用開放心情去探索 把它探索過程也就是 仔細學習 不斷創新 當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情 那覺得是一個貢獻那也是一個熱誠 那可以一直來努力地投入在做
 ---
 ## Performance
-The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline.
 ### Short-form Audio Datasets
-| Dataset\Model             | WLV2-Oracle ↓ | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | Twister (Ours) ↓ |
 |---------------------------|---------------|-------------|-------------|----------------|------------------|
-| ASCEND-OVERALL*           | 21.14 (AUTO)  | 21.14       | 23.22       | 19.71          | **17.74** (-16.08%) |
-| - ASCEND-EN               | 27.20 (EN)    | 27.36       | 27.21       | 29.39          | **26.64** (-2.63%)  |
-| - ASCEND-ZH               | **13.75** (ZH)| 17.49       | 17.41       | 18.90          | 16.04 (-8.29%)     |
-| - ASCEND-MIX*             | 21.01 (AUTO)  | 21.01       | 25.13       | 17.34          | **16.38** (-22.01%) |
-| CommonVoice16-zh-TW       | 9.02 (ZH)     | 9.84        | 8.95        | 11.86          | **7.97** (-19%)     |
-| CSZS-zh-en*               | 29.49 (AUTO)  | 29.49       | 26.43       | 20.90          | **13.01** (-55.88%) |
 ### Long-form Audio Datasets
-| Dataset\Model             | WLV2-Oracle ↓ | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | Twister (Ours) ↓ |
 |---------------------------|---------------|-------------|-------------|----------------|------------------|
-| ML-lecture-2021-long*     | 6.13 (ZH)     | 6.13        | 6.41        | 6.37           | **4.98** (-18.76%) |
-| Formosa-Go                | 15.03 (ZH)    | 15.03       | 14.90       | 16.83          | **13.61** (-9.44%) |
-| Formosa-Show              | 29.18 (ZH)    | 29.18       | 27.80       | 29.78          | **27.58** (-5.48%) |
-| Formosa-Course            | **9.50** (ZH) | 9.50        | 9.67        | 11.12          | 9.94 (+0.44%)      |
-| Formosa-General           | 11.45 (ZH)    | 11.45       | 11.46       | 13.33          | **11.37** (-0.69%) |
-| FormosaSpeech             | 22.34 (ZH)    | 22.34       | 21.22       | 26.71          | **22.09** (-1.12%) |
 \* Code-switching datasets
 ---
 ## 🔧 Usage Example
 To run the model on `input_audio.wav`
@@ -114,25 +159,17 @@ waveform = torch.tensor(audio_array).unsqueeze(0)
 torchaudio.save("input_audio.wav", waveform, sampling_rate)
 # Decoding Results:
 # Whisper: "放進你的權利裡面"
-# Twister: "放進你的 training 裡面" (correct)
 ```
 ---
-## Training Data
-Twister 的訓練採樣自以下數據集：
-The training data of Twister is sample the following publicly available sources:
-| Dataset Name                                                                 | Type   | Language        | Total Hours | License |
-|------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
-| ODC Synth                                                                    | Synth. | Mandarin        | 10,000      | Open Data Commons License Attribution + Apache2.0* |
-| [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real   | English         | 1,738       | Creative Commons Zero |
-| [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST)              | Real   | Code-switching  | 11          | MIT License |
-*ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
 ---

 base_model:
 - openai/whisper-large-v2
 ---
+# Breeze ASR 25
+<img src="./Breeze ASR 25.png" alt="Breeze ASR 25" width="700"/>
+**Breeze ASR 25** 是一款基於 Whisper-large-v2 開發的語音辨識模型，並具有以下特色：
+- 強化繁體中文語境辨識能力
+- 採用單一混和語言向量解碼，強化中英交錯情境語境辨識能力，包含句內以及句外轉換
+- 強化時間戳記對齊，適合自動字幕生成
+**Breeze ASR 25** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper)
+- Optimized for Taiwanese Mandarin
+- Adopted an unified mix embedding for decoding, optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
+- Enhanced time alignment, suitable for automatic captioning
 ---
 ## Example:
+增強範例-中英混用情境： [MediaTek's 24th Anniversary](https://www.youtube.com/watch?v=YkUv5qyhVhw&t=261s)
+Breeze ASR 25:
+```
+面對不知道的我們怎麼用 open mind open heart 的心情去 explore
+那 explore 過程也就是持續學習 不斷創新
+當然如果能帶領 MediaTek 說達到這樣的 position
+對做這樣的事情那覺得是一個 commitment
+那也是一個 passion 那可以一直很努力的投入在做
+```
+Whisper-large-v2:
+```
+面對不知道的我們怎麼用開放心情去探索
+把它探索過程也就是 仔細學習 不斷創新
+當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情
+那覺得是一個貢獻那也是一個熱誠
+那可以一直來努力地投入在做
+```
 ---
 ## Performance
+The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the [paper](https://arxiv.org/pdf/2506.11130) as "Twister"
 ### Short-form Audio Datasets
+| Dataset\Model             | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ |
 |---------------------------|---------------|-------------|-------------|----------------|------------------|
+| ASCEND-OVERALL*           | Mixed | 21.14       | 23.22       | 19.71          | **17.74** (-16.08%) |
+| - ASCEND-EN               | English    | 27.36       | 27.21       | 29.39          | **26.64** (-2.63%)  |
+| - ASCEND-ZH               | Mandarin | 17.49       | 17.41       | 18.90          | **16.04** (-8.29%)     |
+| - ASCEND-MIX*             | Mixed  | 21.01       | 25.13       | 17.34          | **16.38** (-22.01%) |
+| CommonVoice16-zh-TW       | Mandarin     | 9.84        | 8.95        | 11.86          | **7.97** (-19%)     |
+| CSZS-zh-en*               | Mixed  | 29.49       | 26.43       | 20.90          | **13.01** (-55.88%) |
 ### Long-form Audio Datasets
+| Dataset\Model             | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ |
 |---------------------------|---------------|-------------|-------------|----------------|------------------|
+| ML-lecture-2021-long*     | Mandarin     | 6.13        | 6.41        | 6.37           | **4.98** (-18.76%) |
+| Formosa-Go                | Mandarin    | 15.03       | 14.90       | 16.83          | **13.61** (-9.44%) |
+| Formosa-Show              | Mandarin   | 29.18       | 27.80       | 29.78          | **27.58** (-5.48%) |
+| Formosa-Course            | Mandarin | **9.50**       | 9.67        | 11.12          | 9.94 (+0.44%)      |
+| Formosa-General           | Mandarin    | 11.45       | 11.46       | 13.33          | **11.37** (-0.69%) |
+| FormosaSpeech             | Mandarin   | 22.34       | 21.22       | 26.71          | **22.09** (-1.12%) |
 \* Code-switching datasets
 ---
+## Training Data
+所有 Twister 的的訓練取樣自**寬鬆自由軟體授權條款**的數據集，中文部分完全採用合成語音資料：
+The training data of Twister is sampled from the following publicly available sources with **permissive open-source licenses**, where all Chinese data are synthetic:
+| Dataset Name                                                                 | Type   | Language        | Total Hours | License |
+|------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
+| ODC Synth                                                                    | Synthetic | Mandarin        | 10,000      | Open Data Commons License Attribution + Apache2.0* |
+| [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real   | English         | 1,738       | Creative Commons Zero |
+| [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST)              | Real   | Code-switching  | 11          | MIT License |
+*ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
+Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our [paper](https://arxiv.org/pdf/2506.11130).
+---
 ## 🔧 Usage Example
 To run the model on `input_audio.wav`
 torchaudio.save("input_audio.wav", waveform, sampling_rate)
 # Decoding Results:
+# Breeze ASR 25: "放進你的 training 裡面" (correct)
 # Whisper: "放進你的權利裡面"
 ```
 ---
+## Acknowledgements
+We thank NVIDIA for providing access to the Taipei-1 supercomputer.
+We thank Professor Hung-yi Lee for his valuable guidance on this project.
 ---