VLA-Adapter-Lite β GR00T G1 (BridgeAttention Head)
VLA-Adapter-Lite is a small, trainable BridgeAttention policy head that maps vision + language + state β action for the NVIDIA GR00T Teleop G1 humanoid dataset (43-D state/actions).
The vision (SigLIP) and language (Qwen) towers are frozen; only this adapter is trained.
This repo contains only the policy head weights and code. At inference, you load the frozen backbones from their own model hubs.
β¨ Whatβs inside
adapter.pt
/adapter.safetensors
β PyTorch state dict for the policy headpolicy_definition.py
β theBridgeAttentionPolicy
classconfig.json
β dimensions & training config (IDs for base models, dims, etc.)
Backbones (frozen at inference & training):
- Vision:
google/siglip-base-patch16-224
- Language:
Qwen/Qwen2.5-0.5B-Instruct
Target (GR00T G1):
- State: 43-D
- Action: 43-D
- Includes brief language prompts and videos per episode
π Quickstart
import json, torch
from transformers import SiglipVisionModel, SiglipImageProcessor, AutoTokenizer, AutoModelForCausalLM
from policy_definition import BridgeAttentionPolicy
# Load config & backbones
cfg = json.load(open("config.json"))
vision_model_id = cfg["vision_model_id"]
text_model_id = cfg["text_model_id"]
image_processor = SiglipImageProcessor.from_pretrained(vision_model_id)
vision = SiglipVisionModel.from_pretrained(vision_model_id, output_hidden_states=True).eval()
tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
text = AutoModelForCausalLM.from_pretrained(text_model_id, output_hidden_states=True).eval()
# Build policy head and load weights
v_hidden = vision.config.hidden_size
t_hidden = text.config.hidden_size
policy = BridgeAttentionPolicy(
v_hidden=v_hidden, t_hidden=t_hidden,
state_dim=cfg["state_dim"], policy_dim=cfg["policy_dim"],
n_heads=cfg["n_heads"], n_layers=cfg["policy_layers"],
n_queries=cfg["num_action_queries"], action_dim=cfg["action_dim"],
dropout=cfg["dropout"]
).eval()
sd = torch.load("adapter.pt", map_location="cpu")
policy.load_state_dict(sd, strict=True)
# ---- Example forward (single sample) ----
from PIL import Image
instruction = "Pick the apple from the table and place it into the basket."
state = torch.zeros(1, cfg["state_dim"]) # shape [1,43]; replace with real proprio
# Vision: last 4 hidden states (drop CLS token), as a list of tensors
img = Image.new("RGB", (256, 256), color=(200, 230, 255)) # replace with a real frame
v_inputs = image_processor(images=[img], return_tensors="pt")
with torch.no_grad():
v_out = vision(**v_inputs, output_hidden_states=True)
v_feats_layers = [t[:, 1:, :].contiguous() if t.shape[1] >= 2 else t.contiguous()
for t in v_out.hidden_states[-4:]]
# Language: last 4 hidden states
t_inputs = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
t_out = text(**t_inputs, output_hidden_states=True)
t_feats_layers = [t.contiguous() for t in t_out.hidden_states[-4:]]
with torch.no_grad():
action = policy(v_feats_layers, t_feats_layers, state) # [1,43]
print("Pred action:", action.shape)
Evals
- Eval split: 3 episodes Γ 64 frames from each task folder of
nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1
(total 768 frames) - Protocol: offline action reconstruction. For each frame we feed ego-view image + instruction + 43D state into the adapter and compare predicted 43D action against teleop ground truth (MSE / MAE).
Aggregate Metrics
- Overall MSE: 0.0622
- Overall MAE: 0.118
- Frames evaluated: 768
Overall per-joint-group error
Segment | MSE | MAE |
---|---|---|
left_leg | 0.0040 | 0.049 |
right_leg | 0.0055 | 0.047 |
waist | 0.0002 | 0.013 |
left_arm | 0.0455 | 0.157 |
left_hand | 0.1253 | 0.156 |
right_arm | 0.0878 | 0.184 |
right_hand | 0.1154 | 0.143 |
Per-Task Breakdown
Dataset | Samples | MSE | MAE | Arms MSE | Hands MSE |
---|---|---|---|---|---|
g1-pick-apple | 192 | 0.0399 | 0.087 | 0.0362 | 0.0850 |
g1-pick-pear | 192 | 0.0817 | 0.146 | 0.0645 | 0.1808 |
g1-pick-grapes | 192 | 0.0801 | 0.136 | 0.1249 | 0.1175 |
g1-pick-starfruit | 192 | 0.0473 | 0.105 | 0.0411 | 0.0981 |
g1-pick-apple segment error
Segment | MSE | MAE |
---|---|---|
left_leg | 0.0011 | 0.027 |
right_leg | 0.0016 | 0.028 |
waist | 0.0002 | 0.012 |
left_arm | 0.0610 | 0.177 |
left_hand | 0.1664 | 0.202 |
right_arm | 0.0113 | 0.083 |
right_hand | 0.0037 | 0.020 |
g1-pick-pear segment error
Segment | MSE | MAE |
---|---|---|
left_leg | 0.0069 | 0.071 |
right_leg | 0.0061 | 0.057 |
waist | 0.0001 | 0.010 |
left_arm | 0.0374 | 0.153 |
left_hand | 0.1331 | 0.165 |
right_arm | 0.0915 | 0.203 |
right_hand | 0.2285 | 0.262 |
g1-pick-grapes segment error
Segment | MSE | MAE |
---|---|---|
left_leg | 0.0030 | 0.045 |
right_leg | 0.0052 | 0.045 |
waist | 0.0002 | 0.012 |
left_arm | 0.0251 | 0.123 |
left_hand | 0.0058 | 0.022 |
right_arm | 0.2246 | 0.335 |
right_hand | 0.2292 | 0.273 |
g1-pick-starfruit segment error
Segment | MSE | MAE |
---|---|---|
left_leg | 0.0051 | 0.053 |
right_leg | 0.0092 | 0.058 |
waist | 0.0004 | 0.019 |
left_arm | 0.0584 | 0.177 |
left_hand | 0.1959 | 0.235 |
right_arm | 0.0238 | 0.114 |
right_hand | 0.0003 | 0.014 |
More evals comming soon
π References
Core
- Wang, Y. et al. (2025). VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model. arXiv:2509.09372. [paper] Β· [project]
- Kim, M. J., Finn, C., Liang, P. (2025). Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT). arXiv:2502.19645. [paper] Β· [site]
- Kim, M. J. et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246. [paper]
Backbones & Dataset
- Zhai, X. et al. (2023). Sigmoid Loss for Language-Image Pre-Training (SigLIP). arXiv:2303.15343. [paper]
- Yang, A. et al. (2024/2025). Qwen2.5 Technical Report. arXiv:2412.15115. [paper]
- NVIDIA Physical AI (2025). PhysicalAI-Robotics-GR00T-Teleop-G1 (Humanoid teleop dataset). [dataset card]
Related Benchmarks / Corpora
- Liu, B. et al. (2023). LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. arXiv:2306.03310. [paper]
- Walke, H. et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale. arXiv:2308.12952. [paper]
BibTeX
@article{wang2025vlaadapter,
title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
journal={arXiv preprint arXiv:2509.09372},
year={2025}
}
@article{kim2025oft,
title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
journal={arXiv preprint arXiv:2502.19645},
year={2025}
}
@article{kim2024openvla,
title={OpenVLA: An Open-Source Vision-Language-Action Model},
author={Kim, Moo Jin and others},
journal={arXiv preprint arXiv:2406.09246},
year={2024}
}
@article{zhai2023siglip,
title={Sigmoid Loss for Language-Image Pre-Training},
author={Zhai, Xiaohua and others},
journal={arXiv preprint arXiv:2303.15343},
year={2023}
}
@article{yang2024qwen25,
title={Qwen2.5 Technical Report},
author={Yang, An and others},
journal={arXiv preprint arXiv:2412.15115},
year={2024}
}
@dataset{nvidia2025gr00t,
title={PhysicalAI-Robotics-GR00T-Teleop-G1},
author={NVIDIA Physical AI},
year={2025},
howpublished={Hugging Face dataset card},
url={https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1}
}
@article{liu2023libero,
title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
author={Liu, Bingjie and others},
journal={arXiv preprint arXiv:2306.03310},
year={2023}
}
@article{walke2023bridgedatav2,
title={BridgeData V2: A Dataset for Robot Learning at Scale},
author={Walke, Homer and others},
journal={arXiv preprint arXiv:2308.12952},
year={2023}
}
- Downloads last month
- 19