--- license: apache-2.0 datasets: - alibabasglab/VoxCeleb2-mix language: - en tags: - speech pipeline_tag: audio-to-audio --- # Dolphin: Audio-Visual Speech Separation Model Dolphin is a state-of-the-art audio-visual speech separation model that leverages both audio and visual information to separate target speech from background noise and other speakers. ## Model Description This model implements the Dolphin architecture for audio-visual speech separation, combining: - Audio encoder for processing audio signals - Video encoder for processing visual lip movements - Multi-modal fusion mechanism - Transformer-based separator with global and local attention blocks ## Usage ```python from huggingface_hub import PyTorchModelHubMixin import torch # Load the model directly from Hugging Face Hub model = Dolphin.from_pretrained("your-username/dolphin-model") # Example usage audio_input = torch.randn(1, 16000) # 1 second of audio at 16kHz video_input = torch.randn(1, 25, 1, 88, 88) # 25 frames of 88x88 grayscale video # Perform speech separation separated_audio = model(audio_input, video_input) ``` ## Model Architecture - **Audio Encoder**: Processes raw audio waveforms - **Video Encoder**: Processes lip movement sequences - **Feature Projector**: Projects audio features to appropriate dimensions - **Separator**: Multi-stage transformer with global and local attention - **Audio Decoder**: Reconstructs separated audio waveform ## Training Data The model was trained on audio-visual speech separation datasets with mixed speech scenarios. ## Citation If you use this model in your research, please cite the original Dolphin paper. ## License This model is released under the Apache-2.0 License.