ChunkFormer-Large-En-Libri-960h: Pretrained ChunkFormer-Large on 960 hours of LibriSpeech dataset

Model Description
Documentation and Implementation
Benchmark Results
Usage
Citation
Contact

Model Description

ChunkFormer-Large-En-Libri-960h is an English Automatic Speech Recognition (ASR) model based on the ChunkFormer architecture, introduced at ICASSP 2025. The model has been fine-tuned on 960 hours of LibriSpeech, a widely-used dataset for ASR research.

Documentation and Implementation

The Documentation and Implementation of ChunkFormer are publicly available.

Benchmark Results

We evaluate the models using Word Error Rate (WER). To ensure a fair comparison, all models are trained exclusively with the WENET framework.

STT	Model	Test-Clean	Test-Other	Avg.
1	ChunkFormer	2.69	6.91	4.80
2	Efficient Conformer	2.71	6.95	4.83
3	Conformer	2.77	6.93	4.85
4	Squeezeformer	2.87	7.16	5.02

Quick Usage

To use the ChunkFormer model for English Automatic Speech Recognition, follow these steps:

Option 1: Install from PyPI (Recommended)

pip install chunkformer

Option 2: Install from source

git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -e .

Python API Usage

from chunkformer import ChunkFormerModel

# Load the English model from Hugging Face
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-large-en-libri-960h")

# For single long-form audio transcription
transcription = model.endless_decode(
    audio_path="path/to/long_audio.wav",
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=14400,  # in seconds
    return_timestamps=True
)
print(transcription)

# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
    audio_paths=audio_files,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=1800  # Total batch duration in seconds
)

for i, transcription in enumerate(transcriptions):
    print(f"Audio {i+1}: {transcription}")

Command Line Usage

After installation, you can use the command line interface:

chunkformer-decode \
    --model_checkpoint khanhld/chunkformer-large-en-libri-960h \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128

Example Output:

[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio

Advanced Usage can be found HERE

Citation

If you use this work in your research, please cite:

@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}