Paper title and link

The model was presented in the paper From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models.

Paper abstract

The abstract of the paper is the following:

One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at this https URL

SpatialVLA Fine-Tuned on fractal & bridge

This model was produced by fine-tuning the SpatialVLA model on the bridge dataset for Simpler-env benchmark.

Model Details

Model Description

Developed by: The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
Model type: Vision-language-action (language, image => robot actions)
Language(s) (NLP): en
License: MIT
Finetuned from model: paligemma2-3b-pt-224
Pretraining Dataset: Open X-Embodiment and RH20T
Repository: https://github.com/SpatialVLA/SpatialVLA
Paper: SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Project Page & Videos: https://spatialvla.github.io/
Project Page (INT-ACT): https://ai4ce.github.io/INT-ACT/

Uses

SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).

Direct Use

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

Out-of-Scope Use

SpatialVLA models do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine-tuning SpatialVLA models instead.

How to Get Hands Dirty with the Model

If you want to use the model for fine-tuning or pre-training, you need to clone the official repository first.

git clone https://github.com/SpatialVLA/SpatialVLA.git

, then install the required packages and download the model from the Hugging Face model hub. The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0. Hence, create a Python environment with Python >= 3.10.

conda create -n spatialvla python=3.10
conda activate spatialvla

Install packages from requirements.txt file. Note that we use a customised dlimp to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the dlimp_custom.

pip install -r requirements.txt

Train from Scratch

SpatialVLA is pre-trained with 1.1 Million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for abut 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command.

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

Fine-tuning

Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we prefer using LoRA for fine-tuning.

# full fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

Evaluation

SimplerEnv evaluation on Google Robot tasks.

[Table 1]

SimplerEnv evaluation on WidowX Robot tasks.

[Table 2]

Zero-shot Robot Control Evaluation on WidowX Robot.

[Image 1]

Spatial Understanding Capability Evaluation.

[Image 2]

Citation

BibTeX:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}

Note: [Table 1] and [Table 2] refer to the tables present in the original model card. [Image 1] and [Image 2] refer to the images. These are not recreated here due to their length and complexity.

IPEC-COMMUNITY
/

spatialvla-4b-224-sft-bridge