|
|
--- |
|
|
license: gemma |
|
|
library_name: transformers |
|
|
pipeline_tag: visual-question-answering |
|
|
--- |
|
|
# SPEAR-1 model card |
|
|
|
|
|
SPEAR-1 is a cutting-edge Vision-Language-Action (VLA) model capable of achieving performance __superior or on par with state-of-the-art models such as pi0-FAST and pi0.5__ |
|
|
on multiple embodiments while being trained __on 20x less robot data__. |
|
|
|
|
|
This model was developed by [INSAIT](https://insait.ai/), a special unit of Sofia University St. Kliment Ohridski, in Sofia, Bulgaria. |
|
|
|
|
|
Code and model weights for SPEAR-1 models are free to used under the Gemma license. |
|
|
|
|
|
This repo provides model weights fine-tuned for a Franka setup with one wrist and one external camera. |
|
|
|
|
|
## Model description |
|
|
|
|
|
The key to SPEAR-1's data efficiency is SPEAR-VLM, a 3D-aware VLM. SPEAR-VLM extends PaliGemma with the MoGe depth encoder and is trained on 3D VQA tasks using |
|
|
primarily non-robot data sources, such as EgoExo-4D. |
|
|
|
|
|
SPEAR-1's architecture combines SPEAR-VLM with a DiT action expert. It is first pre-trained on a mixture of robot demonstration datasets from Open X Embodiment and |
|
|
then fine-tuned for specific embodiments. |
|
|
|
|
|
## Use with 🤗 Transformers |
|
|
|
|
|
We provide a fully `AutoModel` compatible implementation of SPEAR-1 that can be used via transformers. |
|
|
|
|
|
### Environment setup |
|
|
|
|
|
The current implementation requires the following additional dependencies: `roma`, `timm`, `flash-attn`. |
|
|
|
|
|
Here is a snippet to set up a working environment for inference via `uv`: |
|
|
|
|
|
``` |
|
|
uv venv python 3.10.12 |
|
|
source .venv/bin/activate |
|
|
uv pip install --torch-backend=cu126 roma==1.5.0 numpy==2.2.4 torch==2.6.0 torchvision==0.21.0 transformers==4.47.0 timm==1.0.15 |
|
|
uv pip install --no-build-isolation setuptools psutil flash-attn==2.7.3 |
|
|
``` |
|
|
|
|
|
### Example usage |
|
|
|
|
|
|
|
|
```python |
|
|
from typing import Dict |
|
|
|
|
|
import numpy as np |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoModel |
|
|
|
|
|
model = AutoModel.from_pretrained("INSAIT-Institute/spear1-franka") |
|
|
model = model.to(dtype=torch.bfloat16, device="cuda").eval() |
|
|
|
|
|
main_image = np.asarray(Image.open("path/to/main_image.png")) |
|
|
wrist_image = np.asarray(Image.open("path/to/wrist_image.png")) |
|
|
|
|
|
ee_translation = np.array([0.36, 0.0, 0.56]) |
|
|
ee_rotation = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) |
|
|
gripper = np.array(1.0) |
|
|
|
|
|
model_input: Dict[str, np.ndarray | str | Dict[str, np.ndarray]] = { |
|
|
"images": { |
|
|
"main": main_image, # (H, W, C) |
|
|
"wrist": wrist_image, # (H, W, C) |
|
|
}, |
|
|
"ee_translation": ee_translation, # (3,) |
|
|
"ee_rotation": ee_rotation, # (3, 3) |
|
|
"gripper": gripper, # (1,) |
|
|
"language_instruction": "put the carrot on the blue plate", |
|
|
"dataset_name": "droid" |
|
|
} |
|
|
|
|
|
model_output: Dict[str, np.ndarray] = model.predict_action(model_input) |
|
|
|
|
|
ctrl_translation: np.ndarray = model_output["translation"] # (S, 3) |
|
|
ctrl_rotation: np.ndarray = model_output["rotation"] # (S, 3, 3) |
|
|
ctrl_gripper: np.ndarray = model_output["gripper"] # (S, 1) |
|
|
|
|
|
``` |
|
|
|
|
|
## Action space |
|
|
|
|
|
SPEAR-1 predicts action chunks of delta end-effector positions. Each step in the predicted action chunk is relative to the input state. |
|
|
|
|
|
Given the current end-effector position `[R, t]` and a model prediction `A_rel = [[R_1, t_1], ..., [R_n, t_n]]`, absolute end effector pose commands can be computed as: |
|
|
``` |
|
|
A_abs = [[R * R_1, t + t_1], ..., [R * R_n, t * t_n]] |
|
|
``` |
|
|
|
|
|
## Community Feedback |
|
|
|
|
|
We welcome feedback from the community to help improve SPEAR-1. If you have suggestions, encounter any issues, or have ideas for improvements, please contact us. |
|
|
|
|
|
## Summary |
|
|
|
|
|
- __Model type__: Vision-Language-Action with flow-matching action decoding |
|
|
- __Contact__: [email protected] |
|
|
- __License__: Gemma Terms of Use |