Dolphin: Audio-Visual Speech Separation Model

Dolphin is a state-of-the-art audio-visual speech separation model that leverages both audio and visual information to separate target speech from background noise and other speakers.

Model Description

This model implements the Dolphin architecture for audio-visual speech separation, combining:

Audio encoder for processing audio signals
Video encoder for processing visual lip movements
Multi-modal fusion mechanism
Transformer-based separator with global and local attention blocks

Usage

from huggingface_hub import PyTorchModelHubMixin
import torch

# Load the model directly from Hugging Face Hub
model = Dolphin.from_pretrained("your-username/dolphin-model")

# Example usage
audio_input = torch.randn(1, 16000)  # 1 second of audio at 16kHz
video_input = torch.randn(1, 25, 1, 88, 88)  # 25 frames of 88x88 grayscale video

# Perform speech separation
separated_audio = model(audio_input, video_input)

Model Architecture

Audio Encoder: Processes raw audio waveforms
Video Encoder: Processes lip movement sequences
Feature Projector: Projects audio features to appropriate dimensions
Separator: Multi-stage transformer with global and local attention
Audio Decoder: Reconstructs separated audio waveform

Training Data

The model was trained on audio-visual speech separation datasets with mixed speech scenarios.

Citation

If you use this model in your research, please cite the original Dolphin paper.

License

This model is released under the Apache-2.0 License.

Downloads last month: 112

JusperLee
/

Dolphin