songlindotiot
/

chunkformer-large-vie

@@ -1,195 +1,112 @@
----
-language: vie
-datasets:
-- legacy-datasets/common_voice
-- vlsp2020_vinai_100h
-- AILAB-VNUHCM/vivos
-- doof-ferb/vlsp2020_vinai_100h
-- doof-ferb/fpt_fosd
-- doof-ferb/infore1_25hours
-- linhtran92/viet_bud500
-- doof-ferb/LSVSC
-- doof-ferb/vais1000
-- doof-ferb/VietMed_labeled
-- NhutP/VSV-1100
-- doof-ferb/Speech-MASSIVE_vie
-- doof-ferb/BibleMMS_vie
-- capleaf/viVoice
-metrics:
-- wer
-pipeline_tag: automatic-speech-recognition
-tags:
-- transcription
-- audio
-- speech
-- chunkformer
-- asr
-- automatic-speech-recognition
-license: cc-by-nc-4.0
-model-index:
-- name: ChunkFormer Large Vietnamese
-  results:
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: common-voice-vietnamese
-      type: common_voice
-      args: vi
-    metrics:
-    - name: Test WER
-      type: wer
-      value: 6.66
-    source:
-      name: Common Voice Vi Leaderboard
-      url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: VIVOS
-      type: vivos
-      args: vi
-    metrics:
-    - name: Test WER
-      type: wer
-      value: 4.18
-    source:
-      name: Vivos Leaderboard
-      url: https://paperswithcode.com/sota/speech-recognition-on-vivos
-  - task:
-      name: Speech Recognition
-      type: automatic-speech-recognition
-    dataset:
-      name: VLSP - Task 1
-      type: vlsp
-      args: vi
-    metrics:
-    - name: Test WER
-      type: wer
-      value: 14.09
----
-# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
-<style>
-img {
- display: inline;
-}
-</style>
-[![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
-[![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)
-[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
-[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
-[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
-[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)
----
-## Table of contents
-1. [Model Description](#description)
-2. [Documentation and Implementation](#implementation)
-3. [Benchmark Results](#benchmark)
-4. [Usage](#usage)
-6. [Citation](#citation)
-7. [Contact](#contact)
----
-<a name = "description" ></a>
-## Model Description
-**ChunkFormer-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv).
-**!!! Please note that only the \[train-subset\] was used for tuning the model.**
----
-<a name = "implementation" ></a>
-## Documentation and Implementation
-The [Documentation]() and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.
----
-<a name = "benchmark" ></a>
-## Benchmark Results
-We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.
-1. **Public Models**:
-| STT | Model                                                                  | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
-|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
-| 1   | **ChunkFormer**                                                            | 110M    | 4.18   | 6.66           | 14.09             | **8.31**    |
-| 2   | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)  | 1.55B   | 4.67  | 8.14         | 13.75         | 8.85 |
-| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M     | 10.77 | 18.34        | 13.33         | 14.15 |
-| 4   | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B   | 8.81     | 15.45            | 20.41          | 14.89    |
-| 5   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M     | 15.05 | 10.78        | 31.62             | 19.16    |
-| 6   | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M   | 13.46     | 23.52            | 21.64          | 19.54    |
-2. **Private Models (API)**:
-| STT | Model  | VLSP - Task 1 |
-|-----|--------|---------------|
-| 1   | **ChunkFormer** | **14.1**             |
-| 2   | Viettel     | 14.5          |
-| 3   | Google  | 19.5          |
-| 4   | FPT   | 28.8          |
----
-<a name = "usage" ></a>
-## Quick Usage
-To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:
-1. **Download the ChunkFormer Repository**
-```bash
-git clone https://github.com/khanld/chunkformer.git
-cd chunkformer
-pip install -r requirements.txt
-```
-2. **Download the Model Checkpoint from Hugging Face**
-```bash
-pip install huggingface_hub
-huggingface-cli download khanhld/chunkformer-large-vie --local-dir "./chunkformer-large-vie"
-```
-or
-```bash
-git lfs install
-git clone https://huggingface.co/khanhld/chunkformer-large-vie
-```
-This will download the model checkpoint to the checkpoints folder inside your chunkformer directory.
-3. **Run the model**
-```bash
-python decode.py \
-    --model_checkpoint path/to/local/chunkformer-large-vie \
-    --long_form_audio path/to/audio.wav \
-    --total_batch_duration 14400 \ #in second, default is 1800
-    --chunk_size 64 \
-    --left_context_size 128 \
-    --right_context_size 128
-```
-Example Output:
-```
-[00:00:01.200] - [00:00:02.400]: this is a transcription example
-[00:00:02.500] - [00:00:03.700]: testing the long-form audio
-```
-**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)
----
-<a name = "citation" ></a>
-## Citation
-If you use this work in your research, please cite:
-```bibtex
-@INPROCEEDINGS{10888640,
-  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
-  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
-  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
-  year={2025},
-  volume={},
-  number={},
-  pages={1-5},
-  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
-  doi={10.1109/ICASSP49660.2025.10888640}}
-}
-```
----
-<a name = "contact"></a>
-## Contact
-- [email protected]
-- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
 - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)

+---
+language: vie
+datasets:
+- legacy-datasets/common_voice
+- vlsp2020_vinai_100h
+- AILAB-VNUHCM/vivos
+- doof-ferb/vlsp2020_vinai_100h
+- doof-ferb/fpt_fosd
+- doof-ferb/infore1_25hours
+- linhtran92/viet_bud500
+- doof-ferb/LSVSC
+- doof-ferb/vais1000
+- doof-ferb/VietMed_labeled
+- NhutP/VSV-1100
+- doof-ferb/Speech-MASSIVE_vie
+- doof-ferb/BibleMMS_vie
+- capleaf/viVoice
+metrics:
+- wer
+pipeline_tag: automatic-speech-recognition
+tags:
+- transcription
+- audio
+- speech
+- chunkformer
+- asr
+- automatic-speech-recognition
+license: cc-by-nc-4.0
+model-index:
+- name: ChunkFormer Large Vietnamese
+  results:
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: common-voice-vietnamese
+      type: common_voice
+      args: vi
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 6.66
+    source:
+      name: Common Voice Vi Leaderboard
+      url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: VIVOS
+      type: vivos
+      args: vi
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 4.18
+    source:
+      name: Vivos Leaderboard
+      url: https://paperswithcode.com/sota/speech-recognition-on-vivos
+  - task:
+      name: Speech Recognition
+      type: automatic-speech-recognition
+    dataset:
+      name: VLSP - Task 1
+      type: vlsp
+      args: vi
+    metrics:
+    - name: Test WER
+      type: wer
+      value: 14.09
+---
+# **ChunkFormer-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
+<style>
+img {
+ display: inline;
+}
+</style>
+[![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
+[![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)
+[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
+[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
+[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
+[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)
+---
+<a name = "citation" ></a>
+## Citation
+If you use this work in your research, please cite:
+```bibtex
+@INPROCEEDINGS{10888640,
+  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
+  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
+  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
+  year={2025},
+  volume={},
+  number={},
+  pages={1-5},
+  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
+  doi={10.1109/ICASSP49660.2025.10888640}}
+}
+```
+---
+<a name = "contact"></a>
+## Contact
+- [email protected]
+- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
 - [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)