SpatialVLA

SpatialVLA is a spatial-enhanced vision-language-action model trained on 1.1 Million real robot episodes. The code is purely huggingFace-based and concise, with efficient performance.

All SpatialVLA checkpoints, as well as our training codebase are released under an MIT License.

For full details, please read our paper and see our project page.

Model Details

Model Description

Uses

SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).

Direct Use

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cpu?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

Out-of-Scope Use

SpatialVLA models do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine-tuning SpatialVLA models instead.

How to Get Hands Dirty with the Model

If you want to use the model for fine-tuning or pre-training, you need to clone the official repository first.

git clone https://github.com/SpatialVLA/SpatialVLA.git

, then install the required packages and download the model from the Hugging Face model hub. The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0. Hence, create a Python environment with Python >= 3.10.

conda create -n spatialvla python=3.10
conda activate spatialvla

Install packages from requirements.txt file. Note that we use a customised dlimp to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the dlimp_custom.

pip install -r requirements.txt

Train from Scratch

SpatialVLA is pre-trained with 1.1 Million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for abut 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command.

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

Fine-tuning

Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we prefer using LoRA for fine-tuning.

# full fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

Evaluation

SimplerEnv evaluation on Google Robot tasks.
Model Visual Matching Variant Aggregation
Pick Coke Can Move Near Open/Close Drawer #Average Pick Coke Can Move Near Open/Close Drawer #Average
RT-1 (Begin) 2.7% 5.0% 13.9% 6.8% 2.2% 4.0% 6.9% 4.2%
RT-1 (15%) 71.0% 35.4% 56.5% 60.2% 81.3% 44.6% 26.7% 56.2%
RT-1 (Converged) 85.7% 44.2% 73.0% 74.6% 89.8% 50.0% 32.3% 63.3%
HPT 56.0% 60.0% 24.0% 46.0% -- -- 31.0% 45.0%
TraceVLA 28.0% 53.7% 57.0% 42.0% 60.0% 56.4% 29.4% 39.6%
RT-1-X 56.7% 31.7% 59.7% 53.4% 49.0% 32.3% 35.3% 64.3%
RT-2-X 78.7% 77.9% 25.0% 60.7% 82.3% 79.2% -- --
Octo-Base 17.0% 4.2% 22.7% 16.8% 0.6% 3.1% 1.1% 1.1%
OpenVLA 16.3% 46.2% 35.6% 27.7% 54.5% 47.7% 17.7% 39.8%
RoboVLM (zero-shot) 72.7% 66.3% 26.8% 56.3% 68.3% 56.0% 8.5% 46.3%
RoboVLM (fine-tuning) 77.3% 61.7% 43.5% 63.4% 75.6% 60.0% 10.6% 51.3%
SpatialVLA (zero-shot) 81.0% 69.6% 59.3% 71.9% 89.5% 71.7% 36.2% 68.8%
SpatialVLA (fine-tuning) 86.0% 77.9% 57.4% 75.1% 88.0% 72.7% 41.8% 70.7%
SimplerEnv evaluation on WidowX Robot tasks.
Model Put Spoon on Towel Put Carrot on Plate Stack Green Block on Yellow Block Put Eggplant in Yellow Basket #Overall Average
Grasp Spoon Success Grasp Carrot Success Grasp Green Block Success Grasp Eggplant Success
RT-1-X 16.7% 0.0% 20.8% 4.2% 8.3% 0.0% 0.0% 0.0% 1.1%
Octo-Base 34.7% 12.5% 52.8% 8.3% 31.9% 0.0% 66.7% 43.1% 16.0%
Octo-Small 77.8% 47.2% 27.8% 9.7% 40.3% 4.2% 87.5% 56.9% 30.0%
OpenVLA 4.1% 0.0% 33.3% 0.0% 12.5% 0.0% 8.3% 4.1% 1.0%
RoboVLM (zero-shot) 37.5% 20.8% 33.3% 25.0% 8.3% 8.3% 0.0% 0.0% 13.5%
RoboVLM (fine-tuning) 54.2% 29.2% 25.0% 25.0% 45.8% 12.5% 58.3% 58.3% 31.3%
SpatialVLA (zero-shot) 25.0% 20.8% 41.7% 20.8% 58.3% 25.0% 79.2% 70.8% 34.4%
SpatialVLA (fine-tuning) 20.8% 16.7% 29.2% 25.0% 62.5% 29.2% 100.0% 100.0% 42.7%
LIBERO Simulation Benchmark Results.
Model LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-Long Average
SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓)
Diffusion Policy from scratch 78.3 ± 1.1% 5 92.5 ± 0.7% 1 68.3 ± 1.2% 5 50.5 ± 1.3% 5 72.4 ± 0.7% 5
Octo fine-tuned 78.9 ± 1.0% 4 85.7 ± 0.9% 4 84.6 ± 0.9% 1 51.1 ± 1.3% 4 75.1 ± 0.6% 3
OpenVLA fine-tuned 84.7 ± 0.9% 2 88.4 ± 0.8% 3 79.2 ± 1.0% 2 53.7 ± 1.3% 3 76.5 ± 0.6% 2
TraceVLA fine-tuned 84.6 ± 0.2% 3 85.2 ± 0.4% 5 75.1 ± 0.3% 4 54.1 ± 1.0% 2 74.8 ± 0.5% 4
SpatialVLA fine-tuned 88.2 ± 0.5% 1 89.9 ± 0.7% 2 78.6 ± 0.6% 3 55.5 ± 1.0% 1 78.1 ± 0.7% 1
Zero-shot Robot Control Evaluation on WidowX Robot. perform
Spatial Understanding Capability Evaluation.. perform
Adapting to New Robot Setups on Franka Robot. perform

Citation

BibTeX:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}
Downloads last month
9
Safetensors
Model size
4.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API does not yet support model repos that contain custom code.

Model tree for IPEC-COMMUNITY/spatialvla-4b-224-pt

Finetuned
(8)
this model

Collection including IPEC-COMMUNITY/spatialvla-4b-224-pt