DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
Abstract
Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC-SAM.
Community
The main contributions of this work are:
- We propose a novel prompt-consistency method based on SAM, called Dual-Consistency SAM (DC-SAM), tailored for one-shot segmentation tasks. It exploits the positive and negative features of the visual prompts, leading to high-quality prompts for in-context segmentation. Furthermore, this design can be easily extended to video tasks by combining the SAM and a new mask tube design.
- We introduce a novel cyclic consistent cross-attention mechanism that ensures the final generated prompts better focus on the key regions requiring prompting. When combined with SAM, this mechanism effectively filters out potentially ambiguous components in the features, further enhancing the accuracy and specificity of in-context segmentation.
- We collect a new video in-context segmentation benchmark, IC-VOS (In-Context Video Object Segmentation), featuring manually curated examples sourced from existing video benchmarks. In addition, we benchmark several representative works in IC-VOS.
- With extensive experiments and ablation studies, the proposed method achieves state-of-the-art performance on various datasets and our newly proposed in-context segmentation benchmarks. DC-SAM achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the IC-VOS benchmark.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation (2025)
- DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining (2025)
- Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning (2025)
- Intrinsic Saliency Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation (2025)
- CMaP-SAM: Contraction Mapping Prior for SAM-driven Few-shot Segmentation (2025)
- TSAL: Few-shot Text Segmentation Based on Attribute Learning (2025)
- Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper