songlindotiot commited on
Commit
2617fb8
·
verified ·
1 Parent(s): 914e0b5

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vie
3
+ datasets:
4
+ - legacy-datasets/common_voice
5
+ - vlsp2020_vinai_100h
6
+ - AILAB-VNUHCM/vivos
7
+ - doof-ferb/vlsp2020_vinai_100h
8
+ - doof-ferb/fpt_fosd
9
+ - doof-ferb/infore1_25hours
10
+ - linhtran92/viet_bud500
11
+ - doof-ferb/LSVSC
12
+ - doof-ferb/vais1000
13
+ - doof-ferb/VietMed_labeled
14
+ - NhutP/VSV-1100
15
+ - doof-ferb/Speech-MASSIVE_vie
16
+ - doof-ferb/BibleMMS_vie
17
+ - capleaf/viVoice
18
+ metrics:
19
+ - wer
20
+ pipeline_tag: automatic-speech-recognition
21
+ tags:
22
+ - transcription
23
+ - audio
24
+ - speech
25
+ - chunkformer
26
+ - asr
27
+ - automatic-speech-recognition
28
+ license: cc-by-nc-4.0
29
+ model-index:
30
+ - name: ChunkFormer Large Vietnamese
31
+ results:
32
+ - task:
33
+ name: Speech Recognition
34
+ type: automatic-speech-recognition
35
+ dataset:
36
+ name: common-voice-vietnamese
37
+ type: common_voice
38
+ args: vi
39
+ metrics:
40
+ - name: Test WER
41
+ type: wer
42
+ value: 6.66
43
+ source:
44
+ name: Common Voice Vi Leaderboard
45
+ url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
46
+ - task:
47
+ name: Speech Recognition
48
+ type: automatic-speech-recognition
49
+ dataset:
50
+ name: VIVOS
51
+ type: vivos
52
+ args: vi
53
+ metrics:
54
+ - name: Test WER
55
+ type: wer
56
+ value: 4.18
57
+ source:
58
+ name: Vivos Leaderboard
59
+ url: https://paperswithcode.com/sota/speech-recognition-on-vivos
60
+ - task:
61
+ name: Speech Recognition
62
+ type: automatic-speech-recognition
63
+ dataset:
64
+ name: VLSP - Task 1
65
+ type: vlsp
66
+ args: vi
67
+ metrics:
68
+ - name: Test WER
69
+ type: wer
70
+ value: 14.09
71
+ ---
72
+
73
+ # **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
74
+ <style>
75
+ img {
76
+ display: inline;
77
+ }
78
+ </style>
79
+ [![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
80
+ [![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)
81
+
82
+ [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
83
+ [![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
84
+ [![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
85
+ [![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)
86
+
87
+ ---
88
+ ## Table of contents
89
+ 1. [Model Description](#description)
90
+ 2. [Documentation and Implementation](#implementation)
91
+ 3. [Benchmark Results](#benchmark)
92
+ 4. [Usage](#usage)
93
+ 6. [Citation](#citation)
94
+ 7. [Contact](#contact)
95
+
96
+ ---
97
+ <a name = "description" ></a>
98
+ ## Model Description
99
+ **ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
100
+
101
+ **!!! Please note that only the \[train-subset\] was used for tuning the model.**
102
+
103
+ ---
104
+ <a name = "implementation" ></a>
105
+ ## Documentation and Implementation
106
+ The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
107
+
108
+ ---
109
+ <a name = "benchmark" ></a>
110
+ ## Benchmark Results
111
+ We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
112
+
113
+ 1. **Public Models**:
114
+ | STT | Model | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
115
+ |-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
116
+ | 1 | **ChunkFormer** | 110M | 4.18 | 6.66 | 14.09 | **8.31** |
117
+ | 2 | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large) | 1.55B | 4.67 | 8.14 | 13.75 | 8.85 |
118
+ | 3 | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M | 10.77 | 18.34 | 13.33 | 14.15 |
119
+ | 4 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B | 8.81 | 15.45 | 20.41 | 14.89 |
120
+ | 5 | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M | 15.05 | 10.78 | 31.62 | 19.16 |
121
+ | 6 | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M | 13.46 | 23.52 | 21.64 | 19.54 |
122
+
123
+ 2. **Private Models (API)**:
124
+ | STT | Model | VLSP - Task 1 |
125
+ |-----|--------|---------------|
126
+ | 1 | **ChunkFormer** | **14.1** |
127
+ | 2 | Viettel | 14.5 |
128
+ | 3 | Google | 19.5 |
129
+ | 4 | FPT | 28.8 |
130
+
131
+ ---
132
+ <a name = "usage" ></a>
133
+ ## Quick Usage
134
+ To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
135
+
136
+ 1. **Download the ChunkFormer Repository**
137
+ ```bash
138
+ git clone https://github.com/khanld/chunkformer.git
139
+ cd chunkformer
140
+ pip install -r requirements.txt
141
+ ```
142
+ 2. **Download the Model Checkpoint from Hugging Face**
143
+ ```bash
144
+ pip install huggingface_hub
145
+ huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
146
+ ```
147
+ or
148
+ ```bash
149
+ git lfs install
150
+ git clone https://huggingface.co/khanhld/chunkformer-large-vie
151
+ ```
152
+ This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
153
+
154
+ 3. **Run the model**
155
+ ```bash
156
+ python decode.py \
157
+ --model_checkpoint path/to/local/chunkformer-large-vie \
158
+ --long_form_audio path/to/audio.wav \
159
+ --total_batch_duration 14400 \ #in second, default is 1800
160
+ --chunk_size 64 \
161
+ --left_context_size 128 \
162
+ --right_context_size 128
163
+ ```
164
+ Example Output:
165
+ ```
166
+ [00:00:01.200] - [00:00:02.400]: this is a transcription example
167
+ [00:00:02.500] - [00:00:03.700]: testing the long-form audio
168
+ ```
169
+ **Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
170
+
171
+ ---
172
+ <a name = "citation" ></a>
173
+ ## Citation
174
+ If you use this work in your research, please cite:
175
+
176
+ ```bibtex
177
+ @INPROCEEDINGS{10888640,
178
+ author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
179
+ booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
180
+ title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
181
+ year={2025},
182
+ volume={},
183
+ number={},
184
+ pages={1-5},
185
+ keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
186
+ doi={10.1109/ICASSP49660.2025.10888640}}
187
+ }
188
+ ```
189
+
190
+ ---
191
+ <a name = "contact"></a>
192
+ ## Contact
193
194
+ - [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
195
+ - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)