AVSRCocktail: Audio-Visual Speech Recognition for Cocktail Party Scenarios

Official implementation of "Cocktail-Party Audio-Visual Speech Recognition" (Interspeech 2025).

A robust audio-visual speech recognition system designed for multi-speaker environments and noisy cocktail party scenarios. The model combines lip reading and audio processing to achieve superior performance in challenging acoustic conditions with background noise and speaker interference.

Getting Started

Sections

  1. Installation
  2. Evaluation
  3. Training

1. Installation

Following this steps:

# Clone the baseline code repo
git clone https://github.com/nguyenvulebinh/AVSRCocktail.git
cd AVSRCocktail

# Create Conda environment
conda create --name AVSRCocktail python=3.11
conda activate AVSRCocktail

# Install FFmpeg, if it's not already installed.
conda install ffmpeg

# Install dependencies
pip install -r requirements.txt

2. Evaluation

The evaluation script script/evaluation.py provides comprehensive evaluation capabilities for the AVSR Cocktail model on multiple datasets with various noise conditions and interference scenarios.

Quick Start

Basic evaluation on LRS2 test set:

python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test

Evaluation on AVCocktail dataset:

python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0

Supported Datasets

1. LRS2 Dataset

Evaluate on the LRS2 dataset with various noise conditions:

Available test sets:

  • test: Clean test set
  • test_snr_n5_interferer_1: SNR -5dB with 1 interferer
  • test_snr_n5_interferer_2: SNR -5dB with 2 interferers
  • test_snr_0_interferer_1: SNR 0dB with 1 interferer
  • test_snr_0_interferer_2: SNR 0dB with 2 interferers
  • test_snr_5_interferer_1: SNR 5dB with 1 interferer
  • test_snr_5_interferer_2: SNR 5dB with 2 interferers
  • test_snr_10_interferer_1: SNR 10dB with 1 interferer
  • test_snr_10_interferer_2: SNR 10dB with 2 interferers
  • *: Evaluate on all test sets and report average WER

Example:

# Evaluate on clean test set
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test

# Evaluate on noisy conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id test_snr_0_interferer_1

# Evaluate on all conditions
python script/evaluation.py --model_type avsr_cocktail --dataset_name lrs2 --set_id "*"

2. AVCocktail Dataset

Evaluate on the AVCocktail cocktail party dataset:

Available video sets:

  • video_0 to video_50: Individual video sessions
  • *: Evaluate on all video sessions and report average WER

The evaluation reports WER for three different chunking strategies:

  • asd_chunk: Chunks based on Active Speaker Detection
  • fixed_chunk: Fixed-duration chunks
  • gold_chunk: Ground truth optimal chunks

Example:

# Evaluate on specific video
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id video_0

# Evaluate on all videos
python script/evaluation.py --model_type avsr_cocktail --dataset_name AVCocktail --set_id "*"

Configuration Options

Model Configuration

  • --model_type: Model architecture to use (use avsr_cocktail for the AVSR Cocktail model)
  • --checkpoint_path: Path to custom model checkpoint (default: uses pretrained nguyenvulebinh/AVSRCocktail)
  • --cache_dir: Directory to cache downloaded models (default: ./model-bin)

Processing Parameters

  • --max_length: Maximum length of video segments in seconds (default: 15)
  • --beam_size: Beam size for beam search decoding (default: 3)

Dataset Parameters

  • --dataset_name: Dataset to evaluate on (lrs2 or AVCocktail)
  • --set_id: Specific subset to evaluate (see dataset-specific options above)

Output Options

  • --verbose: Enable verbose output during processing
  • --output_dir_name: Name of output directory for session processing (default: output)

Advanced Usage

Custom model checkpoint:

python script/evaluation.py \
    --model_type avsr_cocktail \
    --dataset_name lrs2 \
    --set_id test \
    --checkpoint_path ./model-bin/my_custom_model \
    --cache_dir ./custom_cache

Optimized inference settings:

python script/evaluation.py \
    --model_type avsr_cocktail \
    --dataset_name AVCocktail \
    --set_id "*" \
    --max_length 10 \
    --beam_size 5 \
    --verbose

Output Format

The evaluation script outputs Word Error Rate (WER) scores:

LRS2 evaluation output:

WER test: 0.1234

AVCocktail evaluation output:

WER video_0 asd_chunk: 0.1234
WER video_0 fixed_chunk: 0.1456  
WER video_0 gold_chunk: 0.1123

When using --set_id "*", the script reports both individual and average WER scores across all test conditions.

3. Training

Model Architecture

  • Encoder: Pre-trained AV-HuBERT large model (nguyenvulebinh/avhubert_encoder_large_noise_pt_noise_ft_433h)
  • Decoder: Transformer decoder with CTC/Attention joint training
  • Tokenization: SentencePiece unigram tokenizer with 5000 vocabulary units
  • Input: Video frames are cropped to the mouth region of interest using a 96 × 96 bounding box, while the audio is sampled at a 16 kHz rate

Training Data

The model is trained on multiple large-scale datasets that have been preprocessed and are ready for the training pipeline. All datasets are hosted on Hugging Face at nguyenvulebinh/AVYT and include:

Dataset Size
LRS2 ~145k samples
VoxCeleb2 ~540k samples
AVYT ~717k samples
AVYT-mix ~483k samples

The information about these datasets can be found in the Cocktail-Party Audio-Visual Speech Recognition paper.

Dataset Features:

  • Preprocessed: All audio-visual data is pre-processed and ready for direct input to the training pipeline
  • Multi-modal: Each sample contains synchronized audio and video (mouth crop) data
  • Labeled: Text transcriptions for supervised learning

The training pipeline automatically handles dataset loading and loads data in streaming mode. However, to make training faster and more stable, it's recommended to download all datasets before running the training pipeline. The storage needed to save all datasets is approximately 1.46 TB.

Training Process

The training script is available at script/train.py.

Multi-GPU Distributed Training:

# Set environment variables for distributed training
export NCCL_DEBUG=WARN
export OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Run with torchrun for multi-GPU training (using default parameters)
torchrun --nproc_per_node 4 script/train.py

# Run with custom parameters
torchrun --nproc_per_node 4 script/train.py \
    --streaming_dataset \
    --batch_size 6 \
    --max_steps 400000 \
    --gradient_accumulation_steps 2 \
    --save_steps 2000 \
    --eval_steps 2000 \
    --learning_rate 1e-4 \
    --warmup_steps 4000 \
    --checkpoint_name avsr_avhubert_ctcattn \
    --model_name_or_path ./model-bin/avsr_cocktail \
    --output_dir ./model-bin

Model Output: The trained model will be saved by default in model-bin/{checkpoint_name}/ (default: model-bin/avsr_avhubert_ctcattn/).

Configuration Options

You can customize training parameters using command line arguments:

Dataset Options:

  • --streaming_dataset: Use streaming mode for datasets (default: False)

Training Parameters:

  • --batch_size: Batch size per device (default: 6)
  • --max_steps: Total training steps (default: 400000)
  • --learning_rate: Initial learning rate (default: 1e-4)
  • --warmup_steps: Learning rate warmup steps (default: 4000)
  • --gradient_accumulation_steps: Gradient accumulation (default: 2)

Checkpoint and Logging:

  • --save_steps: Checkpoint saving frequency (default: 2000)
  • --eval_steps: Evaluation frequency (default: 2000)
  • --log_interval: Logging frequency (default: 25)
  • --checkpoint_name: Name for the checkpoint directory (default: "avsr_avhubert_ctcattn")
  • --resume_from_checkpoint: Resume training from last checkpoint (default: False)

Model and Output:

  • --model_name_or_path: Path to pretrained model (default: "./model-bin/avsr_cocktail")
  • --output_dir: Output directory for checkpoints (default: "./model-bin")
  • --report_to: Logging backend, "wandb" or "none" (default: "none")

Hardware Requirements:

  • GPU Memory: The default training configuration is designed to fit within 24GB GPU memory
  • Training Time: With 2x NVIDIA Titan RTX 24GB GPUs, training takes approximately 56 hours per epoch
  • Convergence: 200,000 steps (total batch size 24) is typically sufficient for model convergence

Acknowledgement

This repository is built using the auto_avsr, espnet, and avhubert repositories.

Contact

[email protected]

Downloads last month
53
Safetensors
Model size
429M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nguyenvulebinh/AVSRCocktail