Abstract
A survey of multimodal referring segmentation techniques, covering advancements in convolutional neural networks, transformers, and large language models for segmenting objects in images, videos, and 3D scenes based on text or audio instructions.
Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.
Community
Survey on Multimodal Referring Segmentation, including Referring Segmentation in image, video, auditory video, and 3D scenes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation (2025)
- Object-centric Video Question Answering with Visual Grounding and Referring (2025)
- A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects (2025)
- Fine-grained Spatiotemporal Grounding on Egocentric Videos (2025)
- SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction (2025)
- FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation (2025)
- MOVE: Motion-Guided Few-Shot Video Object Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper