Papers
arxiv:2509.15490

SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters

Published on Sep 18
Authors:
,
,

Abstract

SmolRGPT, a compact vision-language model with 600M parameters, achieves competitive performance on spatial reasoning tasks in resource-constrained environments by integrating RGB and depth cues.

AI-generated summary

Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: https://github.com/abtraore/SmolRGPT

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.15490 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.15490 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.