Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Abstract
Omnimodal Referring Audio-Visual Segmentation (OmniAVS) and Omnimodal Instructed Segmentation Assistant (OISA) advance audio-visual segmentation by integrating complex multimodal expressions and leveraging MLLM for reasoning-based segmentation.
Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.
Community
OmniAVS: a dataset and method for Omnimodal Referring Audio-Visual Segmentation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Object-centric Video Question Answering with Visual Grounding and Referring (2025)
- LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance (2025)
- Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models (2025)
- InterRVOS: Interaction-aware Referring Video Object Segmentation (2025)
- Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder (2025)
- FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation (2025)
- Revisiting Audio-Visual Segmentation with Vision-Centric Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper