Splend1dchan commited on
Commit
ee2e04a
·
verified ·
1 Parent(s): 5542f59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -36
README.md CHANGED
@@ -6,54 +6,99 @@ language:
6
  base_model:
7
  - openai/whisper-large-v2
8
  ---
9
- # Twister
10
 
11
- <img src="./twister.png" alt="Twister" width="700"/>
12
 
13
 
14
- **Twister** 是一個針對繁體中文以及中英交錯情境增強的語音辨識模型。**Twister** 基於 Whisper-large-v2 之上訓練而成,其中文部分完全採用合成語音資料進行訓練。
15
 
16
- **Twister** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper) with TTS-sythesized data, specially optimized for Taiwanese Mandarin and Mandarin-English code-switching scenarios.
 
 
 
 
 
 
 
 
 
17
 
18
  ---
19
  ## Example:
20
 
21
- Twister:
22
- 面對不知道的我們怎麼用 open mind open heart 的心情去 explore 那 explore 過程也就是持續學習 不斷創新 當然如果能帶領 MediaTek 說達到這樣的 position 對做這樣的事情那覺得是一個 commitment 那也是一個 passion 那可以一直很努力的投入在做
23
 
24
- Whisper:
25
- 我們怎麼用開放心情去探索 把它探索過程也就是 仔細學習 不斷創新 當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情 那覺得是一個貢獻那也是一個熱誠 那可以一直來努力地投入在做
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ---
28
 
29
  ## Performance
30
- The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline.
31
  ### Short-form Audio Datasets
32
 
33
- | Dataset\Model | WLV2-Oracle | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | Twister (Ours) ↓ |
34
  |---------------------------|---------------|-------------|-------------|----------------|------------------|
35
- | ASCEND-OVERALL* | 21.14 (AUTO) | 21.14 | 23.22 | 19.71 | **17.74** (-16.08%) |
36
- | - ASCEND-EN | 27.20 (EN) | 27.36 | 27.21 | 29.39 | **26.64** (-2.63%) |
37
- | - ASCEND-ZH | **13.75** (ZH)| 17.49 | 17.41 | 18.90 | 16.04 (-8.29%) |
38
- | - ASCEND-MIX* | 21.01 (AUTO) | 21.01 | 25.13 | 17.34 | **16.38** (-22.01%) |
39
- | CommonVoice16-zh-TW | 9.02 (ZH) | 9.84 | 8.95 | 11.86 | **7.97** (-19%) |
40
- | CSZS-zh-en* | 29.49 (AUTO) | 29.49 | 26.43 | 20.90 | **13.01** (-55.88%) |
41
 
42
  ### Long-form Audio Datasets
43
 
44
- | Dataset\Model | WLV2-Oracle | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | Twister (Ours) ↓ |
45
  |---------------------------|---------------|-------------|-------------|----------------|------------------|
46
- | ML-lecture-2021-long* | 6.13 (ZH) | 6.13 | 6.41 | 6.37 | **4.98** (-18.76%) |
47
- | Formosa-Go | 15.03 (ZH) | 15.03 | 14.90 | 16.83 | **13.61** (-9.44%) |
48
- | Formosa-Show | 29.18 (ZH) | 29.18 | 27.80 | 29.78 | **27.58** (-5.48%) |
49
- | Formosa-Course | **9.50** (ZH) | 9.50 | 9.67 | 11.12 | 9.94 (+0.44%) |
50
- | Formosa-General | 11.45 (ZH) | 11.45 | 11.46 | 13.33 | **11.37** (-0.69%) |
51
- | FormosaSpeech | 22.34 (ZH) | 22.34 | 21.22 | 26.71 | **22.09** (-1.12%) |
52
 
53
  \* Code-switching datasets
54
 
55
  ---
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ## 🔧 Usage Example
58
 
59
  To run the model on `input_audio.wav`
@@ -114,25 +159,17 @@ waveform = torch.tensor(audio_array).unsqueeze(0)
114
  torchaudio.save("input_audio.wav", waveform, sampling_rate)
115
 
116
  # Decoding Results:
 
117
  # Whisper: "放進你的權利裡面"
118
- # Twister: "放進你的 training 裡面" (correct)
119
  ```
120
  ---
121
- ## Training Data
122
-
123
- Twister 的訓練採樣自以下數據集:
124
-
125
- The training data of Twister is sample the following publicly available sources:
126
 
 
127
 
128
- | Dataset Name | Type | Language | Total Hours | License |
129
- |------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
130
- | ODC Synth | Synth. | Mandarin | 10,000 | Open Data Commons License Attribution + Apache2.0* |
131
- | [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real | English | 1,738 | Creative Commons Zero |
132
- | [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST) | Real | Code-switching | 11 | MIT License |
133
-
134
- *ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
135
 
 
136
 
137
  ---
138
 
 
6
  base_model:
7
  - openai/whisper-large-v2
8
  ---
9
+ # Breeze ASR 25
10
 
11
+ <img src="./Breeze ASR 25.png" alt="Breeze ASR 25" width="700"/>
12
 
13
 
14
+ **Breeze ASR 25** 是一款基於 Whisper-large-v2 開發的語音辨識模型,並具有以下特色:
15
 
16
+ - 強化繁體中文語境辨識能力
17
+ - 採用單一混和語言向量解碼,強化中英交錯情境語境辨識能力,包含句內以及句外轉換
18
+ - 強化時間戳記對齊,適合自動字幕生成
19
+
20
+ **Breeze ASR 25** is an advanced ASR model fine-tuned from [Whisper-large-v2](https://github.com/openai/whisper)
21
+
22
+
23
+ - Optimized for Taiwanese Mandarin
24
+ - Adopted an unified mix embedding for decoding, optimized for Mandarin-English code-switching scenarios, including intra-sentential switching and inter-sentential switching.
25
+ - Enhanced time alignment, suitable for automatic captioning
26
 
27
  ---
28
  ## Example:
29
 
30
+ 增強範例-中英混用情境: [MediaTek's 24th Anniversary](https://www.youtube.com/watch?v=YkUv5qyhVhw&t=261s)
 
31
 
32
+ Breeze ASR 25:
33
+
34
+ ```
35
+ 面對不知道的我們怎麼用 open mind open heart 的心情去 explore
36
+ 那 explore 過程也就是持續學習 不斷創新
37
+ 當然如果能帶領 MediaTek 說達到這樣的 position
38
+ 對做這樣的事情那覺得是一個 commitment
39
+ 那也是一個 passion 那可以一直很努力的投入在做
40
+ ```
41
+
42
+ Whisper-large-v2:
43
+
44
+ ```
45
+ 面對不知道的我們怎麼用開放心情去探索
46
+ 把它探索過程也就是 仔細學習 不斷創新
47
+ 當然如果能帶領MediaTek說 達到這樣的層次 對做這樣的事情
48
+ 那覺得是一個貢獻那也是一個熱誠
49
+ 那可以一直來努力地投入在做
50
+ ```
51
 
52
  ---
53
 
54
  ## Performance
55
+ The WERR is reported in comparison with the Whisper-large-v2 automatic language detection (WLV2-Auto) baseline. "Breeze ASR 25" is refered in the [paper](https://arxiv.org/pdf/2506.11130) as "Twister"
56
  ### Short-form Audio Datasets
57
 
58
+ | Dataset\Model | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ |
59
  |---------------------------|---------------|-------------|-------------|----------------|------------------|
60
+ | ASCEND-OVERALL* | Mixed | 21.14 | 23.22 | 19.71 | **17.74** (-16.08%) |
61
+ | - ASCEND-EN | English | 27.36 | 27.21 | 29.39 | **26.64** (-2.63%) |
62
+ | - ASCEND-ZH | Mandarin | 17.49 | 17.41 | 18.90 | **16.04** (-8.29%) |
63
+ | - ASCEND-MIX* | Mixed | 21.01 | 25.13 | 17.34 | **16.38** (-22.01%) |
64
+ | CommonVoice16-zh-TW | Mandarin | 9.84 | 8.95 | 11.86 | **7.97** (-19%) |
65
+ | CSZS-zh-en* | Mixed | 29.49 | 26.43 | 20.90 | **13.01** (-55.88%) |
66
 
67
  ### Long-form Audio Datasets
68
 
69
+ | Dataset\Model | Language | WLV2-Auto ↓ | WLV3-Auto ↓ | COOL-Whisper ↓ | **Breeze ASR 25 (Ours)** ↓ |
70
  |---------------------------|---------------|-------------|-------------|----------------|------------------|
71
+ | ML-lecture-2021-long* | Mandarin | 6.13 | 6.41 | 6.37 | **4.98** (-18.76%) |
72
+ | Formosa-Go | Mandarin | 15.03 | 14.90 | 16.83 | **13.61** (-9.44%) |
73
+ | Formosa-Show | Mandarin | 29.18 | 27.80 | 29.78 | **27.58** (-5.48%) |
74
+ | Formosa-Course | Mandarin | **9.50** | 9.67 | 11.12 | 9.94 (+0.44%) |
75
+ | Formosa-General | Mandarin | 11.45 | 11.46 | 13.33 | **11.37** (-0.69%) |
76
+ | FormosaSpeech | Mandarin | 22.34 | 21.22 | 26.71 | **22.09** (-1.12%) |
77
 
78
  \* Code-switching datasets
79
 
80
  ---
81
 
82
+ ## Training Data
83
+
84
+ 所有 Twister 的的訓練取樣自**寬鬆自由軟體授權條款**的數據集,中文部分完全採用合成語音資料:
85
+
86
+ The training data of Twister is sampled from the following publicly available sources with **permissive open-source licenses**, where all Chinese data are synthetic:
87
+
88
+
89
+ | Dataset Name | Type | Language | Total Hours | License |
90
+ |------------------------------------------------------------------------------|--------|-----------------|-------------|---------|
91
+ | ODC Synth | Synthetic | Mandarin | 10,000 | Open Data Commons License Attribution + Apache2.0* |
92
+ | [CommonVoice17-EN](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | Real | English | 1,738 | Creative Commons Zero |
93
+ | [NTUML2021](https://huggingface.co/datasets/ky552/ML2021_ASR_ST) | Real | Code-switching | 11 | MIT License |
94
+
95
+
96
+ *ODC Synth is generated by using text from [FineWeb2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (ODC License) and a TTS model [BreezyVoice](https://huggingface.co/MediaTek-Research/BreezyVoice) (Apache2.0 License)
97
+
98
+ Additional code-switching samples are generated through data augmentation with these three datasets; further details can be found in our [paper](https://arxiv.org/pdf/2506.11130).
99
+
100
+ ---
101
+
102
  ## 🔧 Usage Example
103
 
104
  To run the model on `input_audio.wav`
 
159
  torchaudio.save("input_audio.wav", waveform, sampling_rate)
160
 
161
  # Decoding Results:
162
+ # Breeze ASR 25: "放進你的 training 裡面" (correct)
163
  # Whisper: "放進你的權利裡面"
164
+
165
  ```
166
  ---
 
 
 
 
 
167
 
168
+ ## Acknowledgements
169
 
170
+ We thank NVIDIA for providing access to the Taipei-1 supercomputer.
 
 
 
 
 
 
171
 
172
+ We thank Professor Hung-yi Lee for his valuable guidance on this project.
173
 
174
  ---
175