Papers
arxiv:2507.05240

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Published on Jul 7
· Submitted by taiwang on Jul 9
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

StreamVLN, a streaming VLN framework, uses a hybrid slow-fast context modeling strategy to balance fine-grained visual understanding, long-term context modeling, and computational efficiency in real-world settings.

AI-generated summary

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: https://streamvln.github.io/{https://streamvln.github.io/}.

Community

Paper submitter

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Jie the Best

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.05240 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.05240 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.05240 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.