Update README.md
Browse files
README.md
CHANGED
@@ -6,54 +6,99 @@ language:
|
|
6 |
base_model:
|
7 |
- openai/whisper-large-v2
|
8 |
---
|
9 |
-
#
|
10 |
|
11 |
-
<img src="./
|
12 |
|
13 |
|
14 |
-
**
|
15 |
|
16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
---
|
19 |
## Example:
|
20 |
|
21 |
-
|
22 |
-
面對不知道的我們怎麼用 open mind open heart 的心情去 explore 那 explore 過程也就是持續學習 不斷創新 當然如果能帶領 MediaTek 說達到這樣的 position 對做這樣的事情那覺得是一個 commitment 那也是一個 passion 那可以一直很努力的投入在做
|
23 |
|
24 |
-
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
---
|
28 |
|
29 |
## Performance
|
30 |
-
The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline.
|
31 |
### Short-form Audio Datasets
|
32 |
|
33 |
-
| Dataset\Model |
|
34 |
|---------------------------|---------------|-------------|-------------|----------------|------------------|
|
35 |
-
| ASCEND-OVERALL* |
|
36 |
-
| - ASCEND-EN |
|
37 |
-
| - ASCEND-ZH |
|
38 |
-
| - ASCEND-MIX* |
|
39 |
-
| CommonVoice16-zh-TW |
|
40 |
-
| CSZS-zh-en* |
|
41 |
|
42 |
### Long-form Audio Datasets
|
43 |
|
44 |
-
| Dataset\Model |
|
45 |
|---------------------------|---------------|-------------|-------------|----------------|------------------|
|
46 |
-
| ML-lecture-2021-long* |
|
47 |
-
| Formosa-Go |
|
48 |
-
| Formosa-Show |
|
49 |
-
| Formosa-Course |
|
50 |
-
| Formosa-General |
|
51 |
-
| FormosaSpeech |
|
52 |
|
53 |
\* Code-switching datasets
|
54 |
|
55 |
---
|
56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
## 🔧 Usage Example
|
58 |
|
59 |
To run the model on `input_audio.wav`
|
@@ -114,25 +159,17 @@ waveform = torch.tensor(audio_array).unsqueeze(0)
|
|
114 |
torchaudio.save("input_audio.wav", waveform, sampling_rate)
|
115 |
|
116 |
# Decoding Results:
|
|
|
117 |
# Whisper: "放進你的權利裡面"
|
118 |
-
|
119 |
```
|
120 |
---
|
121 |
-
## Training Data
|
122 |
-
|
123 |
-
Twister 的訓練採樣自以下數據集:
|
124 |
-
|
125 |
-
The training data of Twister is sample the following publicly available sources:
|
126 |
|
|
|
127 |
|
128 |
-
|
129 |
-
|------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
|
130 |
-
| ODC Synth | Synth. | Mandarin | 10,000 | Open Data Commons License Attribution + Apache2.0* |
|
131 |
-
| [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real | English | 1,738 | Creative Commons Zero |
|
132 |
-
| [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST) | Real | Code-switching | 11 | MIT License |
|
133 |
-
|
134 |
-
*ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
|
135 |
|
|
|
136 |
|
137 |
---
|
138 |
|
|
|
6 |
base_model:
|
7 |
- openai/whisper-large-v2
|
8 |
---
|
9 |
+
# Breeze ASR 25
|
10 |
|
11 |
+
<img src="./Breeze ASR 25.png" alt="Breeze ASR 25" width="700"/>
|
12 |
|
13 |
|
14 |
+
**Breeze ASR 25** 是一款基於 Whisper-large-v2 開發的語音辨識模型,並具有以下特色:
|
15 |
|
16 |
+
- 強化繁體中文語境辨識能力
|
17 |
+
- 採用單一混和語言向量解碼,強化中英交錯情境語境辨識能力,包含句內以及句外轉換
|
18 |
+
- 強化時間戳記對齊,適合自動字幕生成
|
19 |
+
|
20 |
+
**Breeze ASR 25** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper)
|
21 |
+
|
22 |
+
|
23 |
+
- Optimized for Taiwanese Mandarin
|
24 |
+
- Adopted an unified mix embedding for decoding, optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
|
25 |
+
- Enhanced time alignment, suitable for automatic captioning
|
26 |
|
27 |
---
|
28 |
## Example:
|
29 |
|
30 |
+
增強範例-中英混用情境: [MediaTek's 24th Anniversary](https://www.youtube.com/watch?v=YkUv5qyhVhw&t=261s)
|
|
|
31 |
|
32 |
+
Breeze ASR 25:
|
33 |
+
|
34 |
+
```
|
35 |
+
面對不知道的我們怎麼用 open mind open heart 的心情去 explore
|
36 |
+
那 explore 過程也就是持續學習 不斷創新
|
37 |
+
當然如果能帶領 MediaTek 說達到這樣的 position
|
38 |
+
對做這樣的事情那覺得是一個 commitment
|
39 |
+
那也是一個 passion 那可以一直很努力的投入在做
|
40 |
+
```
|
41 |
+
|
42 |
+
Whisper-large-v2:
|
43 |
+
|
44 |
+
```
|
45 |
+
面對不知道的我們怎麼用開放心情去探索
|
46 |
+
把它探索過程也就是 仔細學習 不斷創新
|
47 |
+
當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情
|
48 |
+
那覺得是一個貢獻那也是一個熱誠
|
49 |
+
那可以一直來努力地投入在做
|
50 |
+
```
|
51 |
|
52 |
---
|
53 |
|
54 |
## Performance
|
55 |
+
The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the [paper](https://arxiv.org/pdf/2506.11130) as "Twister"
|
56 |
### Short-form Audio Datasets
|
57 |
|
58 |
+
| Dataset\Model | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ |
|
59 |
|---------------------------|---------------|-------------|-------------|----------------|------------------|
|
60 |
+
| ASCEND-OVERALL* | Mixed | 21.14 | 23.22 | 19.71 | **17.74** (-16.08%) |
|
61 |
+
| - ASCEND-EN | English | 27.36 | 27.21 | 29.39 | **26.64** (-2.63%) |
|
62 |
+
| - ASCEND-ZH | Mandarin | 17.49 | 17.41 | 18.90 | **16.04** (-8.29%) |
|
63 |
+
| - ASCEND-MIX* | Mixed | 21.01 | 25.13 | 17.34 | **16.38** (-22.01%) |
|
64 |
+
| CommonVoice16-zh-TW | Mandarin | 9.84 | 8.95 | 11.86 | **7.97** (-19%) |
|
65 |
+
| CSZS-zh-en* | Mixed | 29.49 | 26.43 | 20.90 | **13.01** (-55.88%) |
|
66 |
|
67 |
### Long-form Audio Datasets
|
68 |
|
69 |
+
| Dataset\Model | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ |
|
70 |
|---------------------------|---------------|-------------|-------------|----------------|------------------|
|
71 |
+
| ML-lecture-2021-long* | Mandarin | 6.13 | 6.41 | 6.37 | **4.98** (-18.76%) |
|
72 |
+
| Formosa-Go | Mandarin | 15.03 | 14.90 | 16.83 | **13.61** (-9.44%) |
|
73 |
+
| Formosa-Show | Mandarin | 29.18 | 27.80 | 29.78 | **27.58** (-5.48%) |
|
74 |
+
| Formosa-Course | Mandarin | **9.50** | 9.67 | 11.12 | 9.94 (+0.44%) |
|
75 |
+
| Formosa-General | Mandarin | 11.45 | 11.46 | 13.33 | **11.37** (-0.69%) |
|
76 |
+
| FormosaSpeech | Mandarin | 22.34 | 21.22 | 26.71 | **22.09** (-1.12%) |
|
77 |
|
78 |
\* Code-switching datasets
|
79 |
|
80 |
---
|
81 |
|
82 |
+
## Training Data
|
83 |
+
|
84 |
+
所有 Twister 的的訓練取樣自**寬鬆自由軟體授權條款**的數據集,中文部分完全採用合成語音資料:
|
85 |
+
|
86 |
+
The training data of Twister is sampled from the following publicly available sources with **permissive open-source licenses**, where all Chinese data are synthetic:
|
87 |
+
|
88 |
+
|
89 |
+
| Dataset Name | Type | Language | Total Hours | License |
|
90 |
+
|------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
|
91 |
+
| ODC Synth | Synthetic | Mandarin | 10,000 | Open Data Commons License Attribution + Apache2.0* |
|
92 |
+
| [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real | English | 1,738 | Creative Commons Zero |
|
93 |
+
| [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST) | Real | Code-switching | 11 | MIT License |
|
94 |
+
|
95 |
+
|
96 |
+
*ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
|
97 |
+
|
98 |
+
Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our [paper](https://arxiv.org/pdf/2506.11130).
|
99 |
+
|
100 |
+
---
|
101 |
+
|
102 |
## 🔧 Usage Example
|
103 |
|
104 |
To run the model on `input_audio.wav`
|
|
|
159 |
torchaudio.save("input_audio.wav", waveform, sampling_rate)
|
160 |
|
161 |
# Decoding Results:
|
162 |
+
# Breeze ASR 25: "放進你的 training 裡面" (correct)
|
163 |
# Whisper: "放進你的權利裡面"
|
164 |
+
|
165 |
```
|
166 |
---
|
|
|
|
|
|
|
|
|
|
|
167 |
|
168 |
+
## Acknowledgements
|
169 |
|
170 |
+
We thank NVIDIA for providing access to the Taipei-1 supercomputer.
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
|
172 |
+
We thank Professor Hung-yi Lee for his valuable guidance on this project.
|
173 |
|
174 |
---
|
175 |
|