songlindotiot commited on
Commit
6677fde
·
verified ·
1 Parent(s): 2617fb8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -194
README.md CHANGED
@@ -1,195 +1,112 @@
1
- ---
2
- language: vie
3
- datasets:
4
- - legacy-datasets/common_voice
5
- - vlsp2020_vinai_100h
6
- - AILAB-VNUHCM/vivos
7
- - doof-ferb/vlsp2020_vinai_100h
8
- - doof-ferb/fpt_fosd
9
- - doof-ferb/infore1_25hours
10
- - linhtran92/viet_bud500
11
- - doof-ferb/LSVSC
12
- - doof-ferb/vais1000
13
- - doof-ferb/VietMed_labeled
14
- - NhutP/VSV-1100
15
- - doof-ferb/Speech-MASSIVE_vie
16
- - doof-ferb/BibleMMS_vie
17
- - capleaf/viVoice
18
- metrics:
19
- - wer
20
- pipeline_tag: automatic-speech-recognition
21
- tags:
22
- - transcription
23
- - audio
24
- - speech
25
- - chunkformer
26
- - asr
27
- - automatic-speech-recognition
28
- license: cc-by-nc-4.0
29
- model-index:
30
- - name: ChunkFormer Large Vietnamese
31
- results:
32
- - task:
33
- name: Speech Recognition
34
- type: automatic-speech-recognition
35
- dataset:
36
- name: common-voice-vietnamese
37
- type: common_voice
38
- args: vi
39
- metrics:
40
- - name: Test WER
41
- type: wer
42
- value: 6.66
43
- source:
44
- name: Common Voice Vi Leaderboard
45
- url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
46
- - task:
47
- name: Speech Recognition
48
- type: automatic-speech-recognition
49
- dataset:
50
- name: VIVOS
51
- type: vivos
52
- args: vi
53
- metrics:
54
- - name: Test WER
55
- type: wer
56
- value: 4.18
57
- source:
58
- name: Vivos Leaderboard
59
- url: https://paperswithcode.com/sota/speech-recognition-on-vivos
60
- - task:
61
- name: Speech Recognition
62
- type: automatic-speech-recognition
63
- dataset:
64
- name: VLSP - Task 1
65
- type: vlsp
66
- args: vi
67
- metrics:
68
- - name: Test WER
69
- type: wer
70
- value: 14.09
71
- ---
72
-
73
- # **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
74
- <style>
75
- img {
76
- display: inline;
77
- }
78
- </style>
79
- [![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
80
- [![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)
81
-
82
- [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
83
- [![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
84
- [![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
85
- [![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)
86
-
87
- ---
88
- ## Table of contents
89
- 1. [Model Description](#description)
90
- 2. [Documentation and Implementation](#implementation)
91
- 3. [Benchmark Results](#benchmark)
92
- 4. [Usage](#usage)
93
- 6. [Citation](#citation)
94
- 7. [Contact](#contact)
95
-
96
- ---
97
- <a name = "description" ></a>
98
- ## Model Description
99
- **ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
100
-
101
- **!!! Please note that only the \[train-subset\] was used for tuning the model.**
102
-
103
- ---
104
- <a name = "implementation" ></a>
105
- ## Documentation and Implementation
106
- The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
107
-
108
- ---
109
- <a name = "benchmark" ></a>
110
- ## Benchmark Results
111
- We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
112
-
113
- 1. **Public Models**:
114
- | STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
115
- |-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
116
- | 1 | **ChunkFormer** | 110M | 4.18 | 6.66 | 14.09 | **8.31** |
117
- | 2 | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 |
118
- | 3 | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M | 10.77 | 18.34 | 13.33 | 14.15 |
119
- | 4 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B | 8.81 | 15.45 | 20.41 | 14.89 |
120
- | 5 | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M | 15.05 | 10.78 | 31.62 | 19.16 |
121
- | 6 | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M | 13.46 | 23.52 | 21.64 | 19.54 |
122
-
123
- 2. **Private Models (API)**:
124
- | STT | Model | VLSP - Task 1 |
125
- |-----|--------|---------------|
126
- | 1 | **ChunkFormer** | **14.1** |
127
- | 2 | Viettel | 14.5 |
128
- | 3 | Google | 19.5 |
129
- | 4 | FPT | 28.8 |
130
-
131
- ---
132
- <a name = "usage" ></a>
133
- ## Quick Usage
134
- To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
135
-
136
- 1. **Download the ChunkFormer Repository**
137
- ```bash
138
- git clone https://github.com/khanld/chunkformer.git
139
- cd chunkformer
140
- pip install -r requirements.txt
141
- ```
142
- 2. **Download the Model Checkpoint from Hugging Face**
143
- ```bash
144
- pip install huggingface_hub
145
- huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
146
- ```
147
- or
148
- ```bash
149
- git lfs install
150
- git clone https://huggingface.co/khanhld/chunkformer-large-vie
151
- ```
152
- This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
153
-
154
- 3. **Run the model**
155
- ```bash
156
- python decode.py \
157
- --model_checkpoint path/to/local/chunkformer-large-vie \
158
- --long_form_audio path/to/audio.wav \
159
- --total_batch_duration 14400 \ #in second, default is 1800
160
- --chunk_size 64 \
161
- --left_context_size 128 \
162
- --right_context_size 128
163
- ```
164
- Example Output:
165
- ```
166
- [00:00:01.200] - [00:00:02.400]: this is a transcription example
167
- [00:00:02.500] - [00:00:03.700]: testing the long-form audio
168
- ```
169
- **Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
170
-
171
- ---
172
- <a name = "citation" ></a>
173
- ## Citation
174
- If you use this work in your research, please cite:
175
-
176
- ```bibtex
177
- @INPROCEEDINGS{10888640,
178
- author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
179
- booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
180
- title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
181
- year={2025},
182
- volume={},
183
- number={},
184
- pages={1-5},
185
- keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
186
- doi={10.1109/ICASSP49660.2025.10888640}}
187
- }
188
- ```
189
-
190
- ---
191
- <a name = "contact"></a>
192
- ## Contact
193
194
- - [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
195
  - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)
 
1
+ ---
2
+ language: vie
3
+ datasets:
4
+ - legacy-datasets/common_voice
5
+ - vlsp2020_vinai_100h
6
+ - AILAB-VNUHCM/vivos
7
+ - doof-ferb/vlsp2020_vinai_100h
8
+ - doof-ferb/fpt_fosd
9
+ - doof-ferb/infore1_25hours
10
+ - linhtran92/viet_bud500
11
+ - doof-ferb/LSVSC
12
+ - doof-ferb/vais1000
13
+ - doof-ferb/VietMed_labeled
14
+ - NhutP/VSV-1100
15
+ - doof-ferb/Speech-MASSIVE_vie
16
+ - doof-ferb/BibleMMS_vie
17
+ - capleaf/viVoice
18
+ metrics:
19
+ - wer
20
+ pipeline_tag: automatic-speech-recognition
21
+ tags:
22
+ - transcription
23
+ - audio
24
+ - speech
25
+ - chunkformer
26
+ - asr
27
+ - automatic-speech-recognition
28
+ license: cc-by-nc-4.0
29
+ model-index:
30
+ - name: ChunkFormer Large Vietnamese
31
+ results:
32
+ - task:
33
+ name: Speech Recognition
34
+ type: automatic-speech-recognition
35
+ dataset:
36
+ name: common-voice-vietnamese
37
+ type: common_voice
38
+ args: vi
39
+ metrics:
40
+ - name: Test WER
41
+ type: wer
42
+ value: 6.66
43
+ source:
44
+ name: Common Voice Vi Leaderboard
45
+ url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
46
+ - task:
47
+ name: Speech Recognition
48
+ type: automatic-speech-recognition
49
+ dataset:
50
+ name: VIVOS
51
+ type: vivos
52
+ args: vi
53
+ metrics:
54
+ - name: Test WER
55
+ type: wer
56
+ value: 4.18
57
+ source:
58
+ name: Vivos Leaderboard
59
+ url: https://paperswithcode.com/sota/speech-recognition-on-vivos
60
+ - task:
61
+ name: Speech Recognition
62
+ type: automatic-speech-recognition
63
+ dataset:
64
+ name: VLSP - Task 1
65
+ type: vlsp
66
+ args: vi
67
+ metrics:
68
+ - name: Test WER
69
+ type: wer
70
+ value: 14.09
71
+ ---
72
+
73
+ # **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
74
+ <style>
75
+ img {
76
+ display: inline;
77
+ }
78
+ </style>
79
+ [![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
80
+ [![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)
81
+
82
+ [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
83
+ [![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
84
+ [![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
85
+ [![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)
86
+
87
+ ---
88
+
89
+ <a name = "citation" ></a>
90
+ ## Citation
91
+ If you use this work in your research, please cite:
92
+
93
+ ```bibtex
94
+ @INPROCEEDINGS{10888640,
95
+ author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
96
+ booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
97
+ title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
98
+ year={2025},
99
+ volume={},
100
+ number={},
101
+ pages={1-5},
102
+ keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
103
+ doi={10.1109/ICASSP49660.2025.10888640}}
104
+ }
105
+ ```
106
+
107
+ ---
108
+ <a name = "contact"></a>
109
+ ## Contact
110
111
+ - [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)