Dolphin: Audio-Visual Speech Separation Model
Dolphin is a state-of-the-art audio-visual speech separation model that leverages both audio and visual information to separate target speech from background noise and other speakers.
Model Description
This model implements the Dolphin architecture for audio-visual speech separation, combining:
- Audio encoder for processing audio signals
- Video encoder for processing visual lip movements
- Multi-modal fusion mechanism
- Transformer-based separator with global and local attention blocks
Usage
from huggingface_hub import PyTorchModelHubMixin
import torch
# Load the model directly from Hugging Face Hub
model = Dolphin.from_pretrained("your-username/dolphin-model")
# Example usage
audio_input = torch.randn(1, 16000) # 1 second of audio at 16kHz
video_input = torch.randn(1, 25, 1, 88, 88) # 25 frames of 88x88 grayscale video
# Perform speech separation
separated_audio = model(audio_input, video_input)
Model Architecture
- Audio Encoder: Processes raw audio waveforms
- Video Encoder: Processes lip movement sequences
- Feature Projector: Projects audio features to appropriate dimensions
- Separator: Multi-stage transformer with global and local attention
- Audio Decoder: Reconstructs separated audio waveform
Training Data
The model was trained on audio-visual speech separation datasets with mixed speech scenarios.
Citation
If you use this model in your research, please cite the original Dolphin paper.
License
This model is released under the Apache-2.0 License.
- Downloads last month
- 112