File size: 5,982 Bytes
aaa58fc
 
 
 
 
d090d2a
 
 
aaa58fc
f84ce1e
 
 
 
 
 
aaa58fc
69fd91e
f84ce1e
 
 
 
 
 
 
 
 
 
 
d090d2a
f84ce1e
 
 
 
 
 
 
 
 
69fd91e
f84ce1e
69fd91e
f84ce1e
 
 
 
 
 
 
 
 
69fd91e
f84ce1e
 
 
 
 
 
 
 
 
 
69fd91e
 
 
f84ce1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69fd91e
f84ce1e
69fd91e
f84ce1e
69fd91e
f84ce1e
 
 
 
 
 
 
 
69fd91e
 
 
 
f84ce1e
 
d090d2a
 
 
 
 
f84ce1e
d090d2a
 
 
f84ce1e
d090d2a
 
 
 
 
69fd91e
d090d2a
 
69fd91e
f84ce1e
 
 
d090d2a
 
f84ce1e
 
d090d2a
f84ce1e
 
 
d090d2a
 
 
 
f84ce1e
 
 
d090d2a
 
 
 
 
f84ce1e
 
 
d090d2a
 
 
 
69fd91e
 
 
f84ce1e
 
 
 
 
 
 
 
 
 
 
69fd91e
 
 
f84ce1e
 
 
 
 
 
 
 
d090d2a
 
 
f84ce1e
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
datasets:
- alibabasglab/VoxCeleb2-mix
language:
- en
library_name: pytorch
license: apache-2.0
pipeline_tag: audio-to-audio
tags:
- audio-visual
- speech-separation
- cocktail-party
- multimodal
- lip-reading
- audio-processing
---

# Dolphin: Efficient Audio-Visual Speech Separation

<p align="center">
  <img src="https://github.com/JusperLee/Dolphin/raw/main/assets/icon.png" alt="Dolphin Logo" width="120"/>
</p>


## Model Overview

**Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร— faster** and using **50% fewer parameters** than previous methods.

๐Ÿ”— **Links**: [๐Ÿ“„ Paper](https://arxiv.org/abs/2509.23610) | [๐Ÿ’ป Code](https://github.com/JusperLee/Dolphin) | [๐ŸŽฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [๐ŸŒ Project Page](https://cslikai.cn/Dolphin)

## Key Features

- ๐ŸŽฏ **Balanced Quality & Efficiency**: SOTA separation quality without iterative refinement
- ๐Ÿ”ฌ **DP-LipCoder**: Lightweight video encoder with discrete audio-aligned semantic tokens
- ๐ŸŒ **Global-Local Attention**: Multi-scale attention for long-range context and fine-grained details
- ๐Ÿš€ **Edge-Friendly**: >50% parameter reduction, >2.4ร— lower MACs, >6ร— faster inference

## Performance

**VoxCeleb2 Benchmark:**

| Metric | Value |
|--------|-------|
| SI-SNRi | **16.1 dB** |
| SDRi | **16.3 dB** |
| PESQ | **3.45** |
| ESTOI | **0.93** |
| Parameters | **51.3M** (vs 112M in IIANet) |
| MACs | **417G** (vs 1009G in IIANet) |
| Inference Speed | **0.015s/4s-clip** (vs 0.100s in IIANet) |

## Quick Start

### Installation

```bash
pip install torch torchvision torchaudio
pip install huggingface_hub
```

### Inference Example

```python
import torch
from huggingface_hub import hf_hub_download
import yaml

# Download model and config
config_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="conf.yml")
model_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="best_model.pth")

# Load model (you need to import Dolphin class from the repo)
with open(config_path) as f:
    config = yaml.safe_load(f)

model = Dolphin(**config['model'])
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()

# Prepare inputs
# audio: [batch, samples] - 16kHz audio
# video: [batch, frames, 1, height, width] - grayscale lip frames
audio_mixture = torch.randn(1, 64000)  # 4 seconds at 16kHz
video_frames = torch.randn(1, 100, 1, 88, 88)  # 4s at 25fps, 88x88 resolution

# Separate speech
with torch.no_grad():
    separated_audio = model(audio_mixture, video_frames)
```

### Complete Pipeline with Video Input

For end-to-end video processing with face detection and tracking, see our [inference script](https://github.com/JusperLee/Dolphin/blob/main/inference.py):

```bash
git clone https://github.com/JusperLee/Dolphin.git
cd Dolphin
python inference.py \
    --input video.mp4 \
    --output ./output \
    --speakers 2 \
    --config checkpoints/vox2/conf.yml
```

## Model Architecture

### Components

1.  **DP-LipCoder** (Video Encoder)
    -   Dual-path architecture: visual compression + semantic encoding
    -   Vector quantization for discrete lip semantic tokens
    -   Knowledge distillation from AV-HuBERT
    -   Only **8.5M parameters**

2.  **Audio Encoder**
    -   Convolutional encoder for time-frequency representation
    -   Extracts multi-scale acoustic features

3.  **Global-Local Attention Separator**
    -   Single-pass TDANet-based architecture
    -   **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
    -   **Local Attention (LA)**: Heat diffusion attention for noise suppression
    -   No iterative refinement needed

4.  **Audio Decoder**
    -   Reconstructs separated waveform from enhanced features

### Input/Output Specifications

**Inputs:**
-   `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
-   `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps

**Output:**
-   `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz

## Training Details

-   **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
-   **Training**: ~200K steps with Adam optimizer
-   **Augmentation**: Random mixing, noise addition, video frame dropout
-   **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)

## Use Cases

-   ๐ŸŽง **Hearing Aids**: Camera-based speech enhancement
-   ๐Ÿ’ผ **Video Conferencing**: Noise suppression with visual context
-   ๐Ÿš— **In-Car Assistants**: Driver speech extraction
-   ๐Ÿฅฝ **AR/VR**: Immersive communication in noisy environments
-   ๐Ÿ“ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems

## Limitations

-   Requires frontal or near-frontal face view for optimal performance
-   Works best with 25fps video input
-   Trained on English speech (may need fine-tuning for other languages)
-   Performance degrades with severe occlusions or low lighting

## Citation

```bibtex
@misc{li2025dolphin,
  title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention}, 
  author={Kai Li and Kejun Gao and Xiaolin Hu},
  year={2025},
  eprint={2509.23610},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2509.23610}
}
```

## License

Apache-2.0 License. See [LICENSE](https://github.com/JusperLee/Dolphin/blob/main/LICENSE) for details.

## Acknowledgments

Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face team for hosting!

## Contact

-   ๐Ÿ“ง Email: [email protected]
-   ๐Ÿ› Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
-   ๐Ÿ’ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)

---

**Developed by the Audio and Speech Group at Tsinghua University** ๐ŸŽ“