VLA-Adapter-Lite β€” GR00T G1 (BridgeAttention Head)

VLA-Adapter-Lite is a small, trainable BridgeAttention policy head that maps vision + language + state β†’ action for the NVIDIA GR00T Teleop G1 humanoid dataset (43-D state/actions).
The vision (SigLIP) and language (Qwen) towers are frozen; only this adapter is trained.

This repo contains only the policy head weights and code. At inference, you load the frozen backbones from their own model hubs.


✨ What’s inside

  • adapter.pt / adapter.safetensors β€” PyTorch state dict for the policy head
  • policy_definition.py β€” the BridgeAttentionPolicy class
  • config.json β€” dimensions & training config (IDs for base models, dims, etc.)

Backbones (frozen at inference & training):

  • Vision: google/siglip-base-patch16-224
  • Language: Qwen/Qwen2.5-0.5B-Instruct

Target (GR00T G1):

  • State: 43-D
  • Action: 43-D
  • Includes brief language prompts and videos per episode

πŸš€ Quickstart

import json, torch
from transformers import SiglipVisionModel, SiglipImageProcessor, AutoTokenizer, AutoModelForCausalLM
from policy_definition import BridgeAttentionPolicy

# Load config & backbones
cfg = json.load(open("config.json"))

vision_model_id = cfg["vision_model_id"]
text_model_id   = cfg["text_model_id"]

image_processor = SiglipImageProcessor.from_pretrained(vision_model_id)
vision = SiglipVisionModel.from_pretrained(vision_model_id, output_hidden_states=True).eval()
tokenizer = AutoTokenizer.from_pretrained(text_model_id, use_fast=True)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
text = AutoModelForCausalLM.from_pretrained(text_model_id, output_hidden_states=True).eval()

# Build policy head and load weights
v_hidden = vision.config.hidden_size
t_hidden = text.config.hidden_size

policy = BridgeAttentionPolicy(
    v_hidden=v_hidden, t_hidden=t_hidden,
    state_dim=cfg["state_dim"], policy_dim=cfg["policy_dim"],
    n_heads=cfg["n_heads"], n_layers=cfg["policy_layers"],
    n_queries=cfg["num_action_queries"], action_dim=cfg["action_dim"],
    dropout=cfg["dropout"]
).eval()

sd = torch.load("adapter.pt", map_location="cpu")
policy.load_state_dict(sd, strict=True)

# ---- Example forward (single sample) ----
from PIL import Image

instruction = "Pick the apple from the table and place it into the basket."
state = torch.zeros(1, cfg["state_dim"])  # shape [1,43]; replace with real proprio

# Vision: last 4 hidden states (drop CLS token), as a list of tensors
img = Image.new("RGB", (256, 256), color=(200, 230, 255))  # replace with a real frame
v_inputs = image_processor(images=[img], return_tensors="pt")
with torch.no_grad():
    v_out = vision(**v_inputs, output_hidden_states=True)
v_feats_layers = [t[:, 1:, :].contiguous() if t.shape[1] >= 2 else t.contiguous()
                  for t in v_out.hidden_states[-4:]]

# Language: last 4 hidden states
t_inputs = tokenizer([instruction], return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
    t_out = text(**t_inputs, output_hidden_states=True)
t_feats_layers = [t.contiguous() for t in t_out.hidden_states[-4:]]

with torch.no_grad():
    action = policy(v_feats_layers, t_feats_layers, state)  # [1,43]
print("Pred action:", action.shape)

Evals

  • Eval split: 3 episodes Γ— 64 frames from each task folder of nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1 (total 768 frames)
  • Protocol: offline action reconstruction. For each frame we feed ego-view image + instruction + 43D state into the adapter and compare predicted 43D action against teleop ground truth (MSE / MAE).

Aggregate Metrics

  • Overall MSE: 0.0622
  • Overall MAE: 0.118
  • Frames evaluated: 768

Overall per-joint-group error

Segment MSE MAE
left_leg 0.0040 0.049
right_leg 0.0055 0.047
waist 0.0002 0.013
left_arm 0.0455 0.157
left_hand 0.1253 0.156
right_arm 0.0878 0.184
right_hand 0.1154 0.143

Per-Task Breakdown

Dataset Samples MSE MAE Arms MSE Hands MSE
g1-pick-apple 192 0.0399 0.087 0.0362 0.0850
g1-pick-pear 192 0.0817 0.146 0.0645 0.1808
g1-pick-grapes 192 0.0801 0.136 0.1249 0.1175
g1-pick-starfruit 192 0.0473 0.105 0.0411 0.0981

g1-pick-apple segment error

Segment MSE MAE
left_leg 0.0011 0.027
right_leg 0.0016 0.028
waist 0.0002 0.012
left_arm 0.0610 0.177
left_hand 0.1664 0.202
right_arm 0.0113 0.083
right_hand 0.0037 0.020

g1-pick-pear segment error

Segment MSE MAE
left_leg 0.0069 0.071
right_leg 0.0061 0.057
waist 0.0001 0.010
left_arm 0.0374 0.153
left_hand 0.1331 0.165
right_arm 0.0915 0.203
right_hand 0.2285 0.262

g1-pick-grapes segment error

Segment MSE MAE
left_leg 0.0030 0.045
right_leg 0.0052 0.045
waist 0.0002 0.012
left_arm 0.0251 0.123
left_hand 0.0058 0.022
right_arm 0.2246 0.335
right_hand 0.2292 0.273

g1-pick-starfruit segment error

Segment MSE MAE
left_leg 0.0051 0.053
right_leg 0.0092 0.058
waist 0.0004 0.019
left_arm 0.0584 0.177
left_hand 0.1959 0.235
right_arm 0.0238 0.114
right_hand 0.0003 0.014

More evals comming soon

πŸ“š References

Core

  • Wang, Y. et al. (2025). VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model. arXiv:2509.09372. [paper] Β· [project]
  • Kim, M. J., Finn, C., Liang, P. (2025). Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success (OpenVLA-OFT). arXiv:2502.19645. [paper] Β· [site]
  • Kim, M. J. et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246. [paper]

Backbones & Dataset

  • Zhai, X. et al. (2023). Sigmoid Loss for Language-Image Pre-Training (SigLIP). arXiv:2303.15343. [paper]
  • Yang, A. et al. (2024/2025). Qwen2.5 Technical Report. arXiv:2412.15115. [paper]
  • NVIDIA Physical AI (2025). PhysicalAI-Robotics-GR00T-Teleop-G1 (Humanoid teleop dataset). [dataset card]

Related Benchmarks / Corpora

  • Liu, B. et al. (2023). LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. arXiv:2306.03310. [paper]
  • Walke, H. et al. (2023). BridgeData V2: A Dataset for Robot Learning at Scale. arXiv:2308.12952. [paper]

BibTeX

@article{wang2025vlaadapter,
  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
  journal={arXiv preprint arXiv:2509.09372},
  year={2025}
}

@article{kim2025oft,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}

@article{kim2024openvla,
  title={OpenVLA: An Open-Source Vision-Language-Action Model},
  author={Kim, Moo Jin and others},
  journal={arXiv preprint arXiv:2406.09246},
  year={2024}
}

@article{zhai2023siglip,
  title={Sigmoid Loss for Language-Image Pre-Training},
  author={Zhai, Xiaohua and others},
  journal={arXiv preprint arXiv:2303.15343},
  year={2023}
}

@article{yang2024qwen25,
  title={Qwen2.5 Technical Report},
  author={Yang, An and others},
  journal={arXiv preprint arXiv:2412.15115},
  year={2024}
}

@dataset{nvidia2025gr00t,
  title={PhysicalAI-Robotics-GR00T-Teleop-G1},
  author={NVIDIA Physical AI},
  year={2025},
  howpublished={Hugging Face dataset card},
  url={https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-GR00T-Teleop-G1}
}

@article{liu2023libero,
  title={LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning},
  author={Liu, Bingjie and others},
  journal={arXiv preprint arXiv:2306.03310},
  year={2023}
}

@article{walke2023bridgedatav2,
  title={BridgeData V2: A Dataset for Robot Learning at Scale},
  author={Walke, Homer and others},
  journal={arXiv preprint arXiv:2308.12952},
  year={2023}
}
Downloads last month
19
Video Preview
loading

Model tree for Nirav-Madhani/vla-adapter-gr00t-g1-bridgeattention

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(486)
this model

Dataset used to train Nirav-Madhani/vla-adapter-gr00t-g1-bridgeattention