Papers
arxiv:2507.05948

Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation

Published on Jul 8
Authors:
,
,
,

Abstract

Geometric awareness through monocular depth estimation enhances video instance segmentation robustness, with the Expanding Depth Channel method achieving state-of-the-art results on the OVIS benchmark.

AI-generated summary

Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.05948 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.05948 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.05948 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.