arxiv:2510.26909

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Published on Oct 30

· Submitted by

Moritz Reuss on Nov 4

Robotic Systems Lab - ETH Zürich

Upvote

Authors:

Tim Windecker ,

Manthan Patel ,

Moritz Reuss ,

Abstract

NaviTrace is a Visual Question Answering benchmark for evaluating robotic navigation capabilities using a semantic-aware trace score across various scenarios and embodiment types.

AI-generated summary

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

View arXiv page View PDF Project page GitHub 7 Add to collection

Community

mbreuss

Paper author Paper submitter 2 days ago

NaviTrace is a novel VQA benchmark for evaluating vision–language models (VLMs) on their embodiment-specific understanding of navigation across diverse real-world scenarios. Given a natural-language instruction and an embodiment type (human, legged robot, wheeled robot, bicycle), a model must output a 2D navigation path in image space, which we call a trace.

The benchmark includes 1000 scenarios with 3000+ expert traces, divided into:

Validation split (50%) for experimentation and model fine-tuning.
Test split (50%) with hidden ground truths for a fair leaderboard evaluation.

The dataset is available on Hugging Face. We provide ready-to-use evaluation scripts for API-based model inference and scoring, along with a leaderboard where users can compute scores on the test split and optionally submit their models for public comparison.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.26909 in a model README.md to link it from this page.

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1