SPEAR-1 model card

SPEAR-1 is a cutting-edge Vision-Language-Action (VLA) model capable of achieving performance superior or on par with state-of-the-art models such as pi0-FAST and pi0.5 on multiple embodiments while being trained on 20x less robot data.

This model was developed by INSAIT, a special unit of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria.

Code and model weights for SPEAR-1 models are free to used under the Gemma license.

This repo provides model weights fine-tuned for a Franka setup with one wrist and one external camera.

Model description

The key to SPEAR-1's data efficiency is SPEAR-VLM, a 3D-aware VLM. SPEAR-VLM extends PaliGemma with the MoGe depth encoder and is trained on 3D VQA tasks using primarily non-robot data sources, such as EgoExo-4D.

SPEAR-1's architecture combines SPEAR-VLM with a DiT action expert. It is first pre-trained on a mixture of robot demonstration datasets from Open X Embodiment and then fine-tuned for specific embodiments.

Use with 🤗 Transformers

We provide a fully AutoModel compatible implementation of SPEAR-1 that can be used via transformers.

Environment setup

The current implementation requires the following additional dependencies: roma, timm, flash-attn.

Here is a snippet to set up a working environment for inference via uv:

Install uv:

wget -qO- https://github.com/astral-sh/uv/releases/download/0.7.5/uv-installer.sh | sh

Create virtualenv and resolve the dependencies:

uv venv python 3.10.12
source .venv/bin/activate
uv pip install --torch-backend=cu126 roma==1.5.0 numpy==2.2.4 torch==2.6.0 torchvision==0.21.0 transformers==4.47.0 timm==1.0.15 
uv pip install --no-build-isolation setuptools psutil flash-attn==2.7.3

Example usage

from typing import Dict

import numpy as np
import torch
from PIL import Image
from transformers import AutoModel

model = AutoModel.from_pretrained("INSAIT-Institute/spear1-franka")
model = model.to(dtype=torch.bfloat16, device="cuda").eval()

main_image = np.asarray(Image.open("path/to/main_image.png"))
wrist_image = np.asarray(Image.open("path/to/wrist_image.png"))

ee_translation = np.array([0.36, 0.0, 0.56])
ee_rotation = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
gripper = np.array(1.0)

model_input: Dict[str, np.ndarray | str | Dict[str, np.ndarray]] = {
    "images": {
      "main": main_image, # (H, W, C)
      "wrist": wrist_image, # (H, W, C)
    },
    "ee_translation": ee_translation, # (3,)
    "ee_rotation": ee_rotation, # (3, 3)
    "gripper": gripper, # (1,)
    "language_instruction": "put the carrot on the blue plate",
    "dataset_name": "droid"
}

model_output: Dict[str, np.ndarray] = model.predict_action(model_input)

ctrl_translation: np.ndarray = model_output["translation"] # (S, 3)
ctrl_rotation: np.ndarray = model_output["rotation"] # (S, 3, 3)
ctrl_gripper: np.ndarray = model_output["gripper"] # (S, 1)

Action space

SPEAR-1 predicts action chunks of delta end-effector positions. Each step in the predicted action chunk is relative to the input state.

Given the current end-effector position [R, t] and a model prediction A_rel = [[R_1, t_1], ..., [R_n, t_n]], absolute end effector pose commands can be computed as:

A_abs = [[R * R_1, t + t_1], ..., [R * R_n, t * t_n]]

Community Feedback

We welcome feedback from the community to help improve SPEAR-1. If you have suggestions, encounter any issues, or have ideas for improvements, please contact us.

Summary

Model type: Vision-Language-Action with flow-matching action decoding
Contact: [email protected]
License: Gemma Terms of Use

Downloads last month: 93

Safetensors

Model size

4B params

Tensor type

F32

Collection including INSAIT-Institute/spear1-franka

SPEAR

Collection

Models of the SPEAR series. • 1 item • Updated Oct 22 • 3