Papers
arxiv:2510.26909

NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Published on Oct 30
· Submitted by Moritz Reuss on Nov 4
Authors:
,
,
,
,

Abstract

NaviTrace is a Visual Question Answering benchmark for evaluating robotic navigation capabilities using a semantic-aware trace score across various scenarios and embodiment types.

AI-generated summary

Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.

Community

Paper author Paper submitter

NaviTrace is a novel VQA benchmark for evaluating vision–language models (VLMs) on their embodiment-specific understanding of navigation across diverse real-world scenarios. Given a natural-language instruction and an embodiment type (human, legged robot, wheeled robot, bicycle), a model must output a 2D navigation path in image space, which we call a trace.

The benchmark includes 1000 scenarios with 3000+ expert traces, divided into:

  • Validation split (50%) for experimentation and model fine-tuning.
  • Test split (50%) with hidden ground truths for a fair leaderboard evaluation.

The dataset is available on Hugging Face. We provide ready-to-use evaluation scripts for API-based model inference and scoring, along with a leaderboard where users can compute scores on the test split and optionally submit their models for public comparison.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.26909 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 1