arxiv:2505.19080

ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning

Published on May 25

Authors:

Abstract

ReFineVLA enhances VLA models by incorporating expert-generated reasoning rationales, improving their interpretability, generalization, and multimodal understanding in robotic manipulation tasks.

AI-generated summary

Vision-Language-Action (VLA) models have gained much attention from the research community thanks to their strength in translating multimodal observations with linguistic instructions into robotic actions. Despite their recent advancements, VLAs often overlook the explicit reasoning and only learn the functional input-action mappings, omitting these crucial logical steps for interpretability and generalization for complex, long-horizon manipulation tasks. In this work, we propose ReFineVLA, a multimodal reasoning-aware framework that fine-tunes VLAs with teacher-guided reasons. We first augment robotic datasets with reasoning rationales generated by an expert teacher model, guiding VLA models to learn to reason about their actions. Then, we use ReFineVLA to fine-tune pre-trained VLAs with the reasoning-enriched datasets, while maintaining their inherent generalization abilities and boosting reasoning capabilities. In addition, we conduct an attention map visualization to analyze the alignment among visual attention, linguistic prompts, and to-be-executed actions of ReFineVLA, showcasing its ability to focus on relevant tasks and actions. Through the latter step, we explore that ReFineVLA-trained models exhibit a meaningful attention shift towards relevant objects, highlighting the enhanced multimodal understanding and improved generalization. Evaluated across manipulation tasks, ReFineVLA outperforms the state-of-the-art baselines. Specifically, it achieves an average increase of 5.0% success rate on SimplerEnv WidowX Robot tasks, improves by an average of 8.6% in variant aggregation settings, and by 1.7% in visual matching settings for SimplerEnv Google Robot tasks. The source code will be publicly available.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.19080 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.19080 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.19080 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.