File size: 5,982 Bytes
aaa58fc d090d2a aaa58fc f84ce1e aaa58fc 69fd91e f84ce1e d090d2a f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e 69fd91e f84ce1e d090d2a f84ce1e d090d2a f84ce1e d090d2a 69fd91e d090d2a 69fd91e f84ce1e d090d2a f84ce1e d090d2a f84ce1e d090d2a f84ce1e d090d2a f84ce1e d090d2a 69fd91e f84ce1e 69fd91e f84ce1e d090d2a f84ce1e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
datasets:
- alibabasglab/VoxCeleb2-mix
language:
- en
library_name: pytorch
license: apache-2.0
pipeline_tag: audio-to-audio
tags:
- audio-visual
- speech-separation
- cocktail-party
- multimodal
- lip-reading
- audio-processing
---
# Dolphin: Efficient Audio-Visual Speech Separation
<p align="center">
<img src="https://github.com/JusperLee/Dolphin/raw/main/assets/icon.png" alt="Dolphin Logo" width="120"/>
</p>
## Model Overview
**Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร faster** and using **50% fewer parameters** than previous methods.
๐ **Links**: [๐ Paper](https://arxiv.org/abs/2509.23610) | [๐ป Code](https://github.com/JusperLee/Dolphin) | [๐ฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [๐ Project Page](https://cslikai.cn/Dolphin)
## Key Features
- ๐ฏ **Balanced Quality & Efficiency**: SOTA separation quality without iterative refinement
- ๐ฌ **DP-LipCoder**: Lightweight video encoder with discrete audio-aligned semantic tokens
- ๐ **Global-Local Attention**: Multi-scale attention for long-range context and fine-grained details
- ๐ **Edge-Friendly**: >50% parameter reduction, >2.4ร lower MACs, >6ร faster inference
## Performance
**VoxCeleb2 Benchmark:**
| Metric | Value |
|--------|-------|
| SI-SNRi | **16.1 dB** |
| SDRi | **16.3 dB** |
| PESQ | **3.45** |
| ESTOI | **0.93** |
| Parameters | **51.3M** (vs 112M in IIANet) |
| MACs | **417G** (vs 1009G in IIANet) |
| Inference Speed | **0.015s/4s-clip** (vs 0.100s in IIANet) |
## Quick Start
### Installation
```bash
pip install torch torchvision torchaudio
pip install huggingface_hub
```
### Inference Example
```python
import torch
from huggingface_hub import hf_hub_download
import yaml
# Download model and config
config_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="conf.yml")
model_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="best_model.pth")
# Load model (you need to import Dolphin class from the repo)
with open(config_path) as f:
config = yaml.safe_load(f)
model = Dolphin(**config['model'])
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()
# Prepare inputs
# audio: [batch, samples] - 16kHz audio
# video: [batch, frames, 1, height, width] - grayscale lip frames
audio_mixture = torch.randn(1, 64000) # 4 seconds at 16kHz
video_frames = torch.randn(1, 100, 1, 88, 88) # 4s at 25fps, 88x88 resolution
# Separate speech
with torch.no_grad():
separated_audio = model(audio_mixture, video_frames)
```
### Complete Pipeline with Video Input
For end-to-end video processing with face detection and tracking, see our [inference script](https://github.com/JusperLee/Dolphin/blob/main/inference.py):
```bash
git clone https://github.com/JusperLee/Dolphin.git
cd Dolphin
python inference.py \
--input video.mp4 \
--output ./output \
--speakers 2 \
--config checkpoints/vox2/conf.yml
```
## Model Architecture
### Components
1. **DP-LipCoder** (Video Encoder)
- Dual-path architecture: visual compression + semantic encoding
- Vector quantization for discrete lip semantic tokens
- Knowledge distillation from AV-HuBERT
- Only **8.5M parameters**
2. **Audio Encoder**
- Convolutional encoder for time-frequency representation
- Extracts multi-scale acoustic features
3. **Global-Local Attention Separator**
- Single-pass TDANet-based architecture
- **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
- **Local Attention (LA)**: Heat diffusion attention for noise suppression
- No iterative refinement needed
4. **Audio Decoder**
- Reconstructs separated waveform from enhanced features
### Input/Output Specifications
**Inputs:**
- `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
- `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
**Output:**
- `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
## Training Details
- **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
- **Training**: ~200K steps with Adam optimizer
- **Augmentation**: Random mixing, noise addition, video frame dropout
- **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
## Use Cases
- ๐ง **Hearing Aids**: Camera-based speech enhancement
- ๐ผ **Video Conferencing**: Noise suppression with visual context
- ๐ **In-Car Assistants**: Driver speech extraction
- ๐ฅฝ **AR/VR**: Immersive communication in noisy environments
- ๐ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems
## Limitations
- Requires frontal or near-frontal face view for optimal performance
- Works best with 25fps video input
- Trained on English speech (may need fine-tuning for other languages)
- Performance degrades with severe occlusions or low lighting
## Citation
```bibtex
@misc{li2025dolphin,
title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention},
author={Kai Li and Kejun Gao and Xiaolin Hu},
year={2025},
eprint={2509.23610},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2509.23610}
}
```
## License
Apache-2.0 License. See [LICENSE](https://github.com/JusperLee/Dolphin/blob/main/LICENSE) for details.
## Acknowledgments
Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face team for hosting!
## Contact
- ๐ง Email: [email protected]
- ๐ Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
- ๐ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
---
**Developed by the Audio and Speech Group at Tsinghua University** ๐ |