---
license: apache-2.0
datasets:
- alibabasglab/VoxCeleb2-mix
language:
- en
tags:
- speech
pipeline_tag: audio-to-audio
---
# Dolphin: Audio-Visual Speech Separation Model

Dolphin is a state-of-the-art audio-visual speech separation model that leverages both audio and visual information to separate target speech from background noise and other speakers.

## Model Description

This model implements the Dolphin architecture for audio-visual speech separation, combining:
- Audio encoder for processing audio signals
- Video encoder for processing visual lip movements
- Multi-modal fusion mechanism
- Transformer-based separator with global and local attention blocks

## Usage

```python
from huggingface_hub import PyTorchModelHubMixin
import torch

# Load the model directly from Hugging Face Hub
model = Dolphin.from_pretrained("your-username/dolphin-model")

# Example usage
audio_input = torch.randn(1, 16000)  # 1 second of audio at 16kHz
video_input = torch.randn(1, 25, 1, 88, 88)  # 25 frames of 88x88 grayscale video

# Perform speech separation
separated_audio = model(audio_input, video_input)
```

## Model Architecture

- **Audio Encoder**: Processes raw audio waveforms
- **Video Encoder**: Processes lip movement sequences
- **Feature Projector**: Projects audio features to appropriate dimensions
- **Separator**: Multi-stage transformer with global and local attention
- **Audio Decoder**: Reconstructs separated audio waveform

## Training Data

The model was trained on audio-visual speech separation datasets with mixed speech scenarios.

## Citation

If you use this model in your research, please cite the original Dolphin paper.

## License

This model is released under the Apache-2.0 License.