Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Paper
Efficient Conformer Encoder
Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

Installation
Clone GitHub repository and set up environment
git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
cd EfficientConformerVietnamese
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install protobuf==4.25
Install ctcdecode
Prepare dataset and training pipline
Dataset to train this mini version:
- Vivos
- Vietbud_500
- VLSP2020, VLSP2021, VLSP2022
- VietMed_labeled
- Google Fleurs
Steps:
- Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
- Add noise to the audio using add_noise.py.
- Change the speaking speed using speed_permutation.py.
- Extract audio length and BPE tokens using prepare_dataset.py.
- Filter audio by the maximum length specified, using filter_max_length.py, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
- Train the model using train.py (please read the parameters carefully).
- Prepare a lm_corpus.txt to train n gram bpe language model, using train_lm.py
Evaluation
Please read code test.py carefully !
bash test.sh
Monitor training
tensorboard --logdir callback_path

Vietnamese Performance
Model | Gigaspeech_test (Greedy / n-gram Beam Search) |
VLSP2023_pb_test (Greedy / n-gram Beam Search) |
VLSP2023_pr_test (Greedy / n-gram Beam Search) |
---|---|---|---|
EC-Small-CTC | 19.61 / 17.47 | 23.06 / 20.83 | 23.17 / 21.15 |
In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below: https://www.overleaf.com/read/nhqjtcpktjyc#3b472e
Reference
- Maxime Burchi @burchim
- Downloads last month
- 1