arxiv:2506.06205

Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

Published on Jun 6

· Submitted by

sc-bd on Jun 10

Upvote

Authors:

Sheng Chen ,

Abstract

Astra, a dual-model architecture for mobile robot navigation, uses a multimodal LLM for global localization and a multitask network for local path planning and odometry estimation, achieving high success rates in diverse indoor environments.

AI-generated summary

Modern robot navigation systems encounter difficulties in diverse and complex indoor environments. Traditional approaches rely on multiple modules with small models or rule-based systems and thus lack adaptability to new environments. To address this, we developed Astra, a comprehensive dual-model architecture, Astra-Global and Astra-Local, for mobile robot navigation. Astra-Global, a multimodal LLM, processes vision and language inputs to perform self and goal localization using a hybrid topological-semantic graph as the global map, and outperforms traditional visual place recognition methods. Astra-Local, a multitask network, handles local path planning and odometry estimation. Its 4D spatial-temporal encoder, trained through self-supervised learning, generates robust 4D features for downstream tasks. The planning head utilizes flow matching and a novel masked ESDF loss to minimize collision risks for generating local trajectories, and the odometry head integrates multi-sensor inputs via a transformer encoder to predict the relative pose of the robot. Deployed on real in-house mobile robots, Astra achieves high end-to-end mission success rate across diverse indoor environments.