Dolphin: Audio-Visual Speech Separation Model

Dolphin is a state-of-the-art audio-visual speech separation model that leverages both audio and visual information to separate target speech from background noise and other speakers.

Model Description

This model implements the Dolphin architecture for audio-visual speech separation, combining:

  • Audio encoder for processing audio signals
  • Video encoder for processing visual lip movements
  • Multi-modal fusion mechanism
  • Transformer-based separator with global and local attention blocks

Usage

from huggingface_hub import PyTorchModelHubMixin
import torch

# Load the model directly from Hugging Face Hub
model = Dolphin.from_pretrained("your-username/dolphin-model")

# Example usage
audio_input = torch.randn(1, 16000)  # 1 second of audio at 16kHz
video_input = torch.randn(1, 25, 1, 88, 88)  # 25 frames of 88x88 grayscale video

# Perform speech separation
separated_audio = model(audio_input, video_input)

Model Architecture

  • Audio Encoder: Processes raw audio waveforms
  • Video Encoder: Processes lip movement sequences
  • Feature Projector: Projects audio features to appropriate dimensions
  • Separator: Multi-stage transformer with global and local attention
  • Audio Decoder: Reconstructs separated audio waveform

Training Data

The model was trained on audio-visual speech separation datasets with mixed speech scenarios.

Citation

If you use this model in your research, please cite the original Dolphin paper.

License

This model is released under the Apache-2.0 License.

Downloads last month
112
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train JusperLee/Dolphin