Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Paper

Efficient Conformer Encoder

Inspired from previous works done in Automatic Speech Recognition and Computer Vision, the Efficient Conformer encoder is composed of three encoder stages where each stage comprises a number of Conformer blocks using grouped attention. The encoded sequence is progressively downsampled and projected to wider feature dimensions, lowering the amount of computation while achieving better performance. Grouped multi-head attention reduce attention complexity by grouping neighbouring time elements along the feature dimension before applying scaled dot-product attention.

Installation

Clone GitHub repository and set up environment

git clone https://github.com/nguyenthienhy/EfficientConformerVietnamese.git
cd EfficientConformerVietnamese
pip install -r requirements.txt
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install protobuf==4.25

Install ctcdecode

Prepare dataset and training pipline

Dataset to train this mini version:

Vivos
Vietbud_500
VLSP2020, VLSP2021, VLSP2022
VietMed_labeled
Google Fleurs

Steps:

Prepare a dataset folder that includes the data domains you want to train on, for example: ASRDataset/VLSP2020, ASRDataset/VLSP2021. Inside each VLSP2020 folder, there should be corresponding .wav and .txt files.
Add noise to the audio using add_noise.py.
Change the speaking speed using speed_permutation.py.
Extract audio length and BPE tokens using prepare_dataset.py.
Filter audio by the maximum length specified, using filter_max_length.py, and save the list of audio files used for training in a .txt file, for example: data/train_wav_names.txt.
Train the model using train.py (please read the parameters carefully).
Prepare a lm_corpus.txt to train n gram bpe language model, using train_lm.py

Evaluation

Please read code test.py carefully !

bash test.sh

Monitor training

tensorboard --logdir callback_path

Vietnamese Performance

Model	Gigaspeech_test (Greedy / n-gram Beam Search)	VLSP2023_pb_test (Greedy / n-gram Beam Search)	VLSP2023_pr_test (Greedy / n-gram Beam Search)
EC-Small-CTC	19.61 / 17.47	23.06 / 20.83	23.17 / 21.15
PhoWhiper-Tiny	20.45	33.21	33.02
PhoWhiper-Base	18.78	29.25	28.29

In the competition organized by VLSP, I used the Efficient Conformer Large architecture with approximately 127 million parameters. You can find the detailed results in the technical report below: https://www.overleaf.com/read/nhqjtcpktjyc#3b472e

Reference

Maxime Burchi, Valentin Vielzeuf. Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition.

Maxime Burchi @burchim