AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios
Official implementation of "Cocktail-Party Audio-Visual Speech Recognition" (Interspeech 2025).
A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.
Getting Started
Sections
1. Installation
Following this steps:
# Clone the baseline code repo
git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
cd AVSRCocktail
# Create Conda environment
conda create --name AVSRCocktail python=3.11
conda activate AVSRCocktail
# Install FFmpeg, if it's not already installed.
conda install ffmpeg
# Install dependencies
pip install -r requirements.txt
2. Evaluation
The evaluation script script/evaluation.py
provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.
Quick Start
Basic evaluation on LRS2 test set:
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
Evaluation on AVCocktail dataset:
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
Supported Datasets
1. LRS2 Dataset
Evaluate on the LRS2 dataset with various noise conditions:
Available test sets:
test
: Clean test settest_snr_n5_interferer_1
: SNR -5dB with 1 interferertest_snr_n5_interferer_2
: SNR -5dB with 2 interfererstest_snr_0_interferer_1
: SNR 0dB with 1 interferertest_snr_0_interferer_2
: SNR 0dB with 2 interfererstest_snr_5_interferer_1
: SNR 5dB with 1 interferertest_snr_5_interferer_2
: SNR 5dB with 2 interfererstest_snr_10_interferer_1
: SNR 10dB with 1 interferertest_snr_10_interferer_2
: SNR 10dB with 2 interferers*
: Evaluate on all test sets and report average WER
Example:
# Evaluate on clean test set
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test
# Evaluate on noisy conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1
# Evaluate on all conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"
2. AVCocktail Dataset
Evaluate on the AVCocktail cocktail party dataset:
Available video sets:
video_0
tovideo_50
: Individual video sessions*
: Evaluate on all video sessions and report average WER
The evaluation reports WER for three different chunking strategies:
asd_chunk
: Chunks based on Active Speaker Detectionfixed_chunk
: Fixed-duration chunksgold_chunk
: Ground truth optimal chunks
Example:
# Evaluate on specific video
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0
# Evaluate on all videos
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"
Configuration Options
Model Configuration
--model_type
: Model architecture to use (useavsr_cocktail
for the AVSR Cocktail model)--checkpoint_path
: Path to custom model checkpoint (default: uses pretrainednguyenvulebinh/AVSRCocktail
)--cache_dir
: Directory to cache downloaded models (default:./model-bin
)
Processing Parameters
--max_length
: Maximum length of video segments in seconds (default: 15)--beam_size
: Beam size for beam search decoding (default: 3)
Dataset Parameters
--dataset_name
: Dataset to evaluate on (lrs2
orAVCocktail
)--set_id
: Specific subset to evaluate (see dataset-specific options above)
Output Options
--verbose
: Enable verbose output during processing--output_dir_name
: Name of output directory for session processing (default:output
)
Advanced Usage
Custom model checkpoint:
python script/evaluation.py \
--model_type avsr_cocktail \
--dataset_name lrs2 \
--set_id test \
--checkpoint_path ./model-bin/my_custom_model \
--cache_dir ./custom_cache
Optimized inference settings:
python script/evaluation.py \
--model_type avsr_cocktail \
--dataset_name AVCocktail \
--set_id "*" \
--max_length 10 \
--beam_size 5 \
--verbose
Output Format
The evaluation script outputs Word Error Rate (WER) scores:
LRS2 evaluation output:
WER test: 0.1234
AVCocktail evaluation output:
WER video_0 asd_chunk: 0.1234
WER video_0 fixed_chunk: 0.1456
WER video_0 gold_chunk: 0.1123
When using --set_id "*"
, the script reports both individual and average WER scores across all test conditions.
3. Training
Model Architecture
- Encoder: Pre-trained AV-HuBERT large model (
nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h
) - Decoder: Transformer decoder with CTC/Attention joint training
- Tokenization: SentencePiece unigram tokenizer with 5000 vocabulary units
- Input: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate
Training Data
The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at nguyenvulebinh/AVYT and include:
Dataset | Size |
---|---|
LRS2 | ~145k samples |
VoxCeleb2 | ~540k samples |
AVYT | ~717k samples |
AVYT-mix | ~483k samples |
The information about these datasets can be found in the Cocktail-Party Audio-Visual Speech Recognition paper.
Dataset Features:
- Preprocessed: All audio-visual data is pre-processed and ready for direct input to the training pipeline
- Multi-modal: Each sample contains synchronized audio and video (mouth crop) data
- Labeled: Text transcriptions for supervised learning
The training pipeline automatically handles dataset loading and loads data in streaming mode. However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.
Training Process
The training script is available at script/train.py
.
Multi-GPU Distributed Training:
# Set environment variables for distributed training
export NCCL_DEBUG=WARN
export OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Run with torchrun for multi-GPU training (using default parameters)
torchrun --nproc_per_node 4 script/train.py
# Run with custom parameters
torchrun --nproc_per_node 4 script/train.py \
--streaming_dataset \
--batch_size 6 \
--max_steps 400000 \
--gradient_accumulation_steps 2 \
--save_steps 2000 \
--eval_steps 2000 \
--learning_rate 1e-4 \
--warmup_steps 4000 \
--checkpoint_name avsr_avhubert_ctcattn \
--model_name_or_path ./model-bin/avsr_cocktail \
--output_dir ./model-bin
Model Output:
The trained model will be saved by default in model-bin/{checkpoint_name}/
(default: model-bin/avsr_avhubert_ctcattn/
).
Configuration Options
You can customize training parameters using command line arguments:
Dataset Options:
--streaming_dataset
: Use streaming mode for datasets (default: False)
Training Parameters:
--batch_size
: Batch size per device (default: 6)--max_steps
: Total training steps (default: 400000)--learning_rate
: Initial learning rate (default: 1e-4)--warmup_steps
: Learning rate warmup steps (default: 4000)--gradient_accumulation_steps
: Gradient accumulation (default: 2)
Checkpoint and Logging:
--save_steps
: Checkpoint saving frequency (default: 2000)--eval_steps
: Evaluation frequency (default: 2000)--log_interval
: Logging frequency (default: 25)--checkpoint_name
: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")--resume_from_checkpoint
: Resume training from last checkpoint (default: False)
Model and Output:
--model_name_or_path
: Path to pretrained model (default: "./model-bin/avsr_cocktail")--output_dir
: Output directory for checkpoints (default: "./model-bin")--report_to
: Logging backend, "wandb" or "none" (default: "none")
Hardware Requirements:
- GPU Memory: The default training configuration is designed to fit within 24GB GPU memory
- Training Time: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately 56 hours per epoch
- Convergence: 200,000 steps (total batch size 24) is typically sufficient for model convergence
Acknowledgement
This repository is built using the auto_avsr, espnet, and avhubert repositories.
Contact
- Downloads last month
- 53