BAAI
/

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to not use the model to conduct experiments that cause harm to human subjects.

Log in or Sign Up to review the conditions and access this model content.

RoboBrain-X0: A Unified Cross-Embodiment Vision-Language-Action Model for Token Reasoning and Action Generation.

  β­οΈ Project   |   πŸ€— Hugging Face   |   πŸ€– ModelScope  

  πŸš€ RoboBrain 2.0: See Better. Think Harder. Do Smarter.

  πŸŽ― RoboOS: An Efficient Open-Source Multi-Robot Coordination System for RoboBrain.

  β­οΈ Reason-RFT: Core Post-Training Strategy for Embodied Visual Reasoning in RoboBrain2.0.

  πŸŒ RoboBrain 1.0: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete.

πŸ’¬ If you have any questions, feel free to contact us via WeChat or RedNote.

πŸ”₯ Overview

We are thrilled to introduce RoboBrain-X0, a groundbreaking cross-ontology foundation model designed to overcome the limitations of single-robot systems in heterogeneous ontology transfer. By leveraging End-Effector pose representation in SE(3) task space, coupled with a Unified Action Vocabulary (UAV) and action tokenizer, RoboBrain-X0 achieves efficient zero-shot generalization and complex task decomposition. Its Grouped Residual Quantizer (GRVQ) maps continuous control sequences from diverse degrees of freedom and mechanical structures to a shared discrete action primitive space, ensuring semantic consistency and transferability across ontologies such as AgileX, R1-Lite dual-arm wheeled robots, and Franka single-arm systems. Through ontology-conditioned diverse prompting, the model supports flexible decoding from multi-view RGB-D inputs to specific executions, significantly reducing training and inference overhead. RoboBrain-X0 delivers state-of-the-art performance in embodied reasoning tasks, laying a robust foundation for developing versatile, real-world robotic agents and advancing embodied intelligence research.

πŸ—žοΈ News

πŸ“† Todo

  • Release model checkpoint for RoboBrain-X0-Preview based on Fast Tokenizer
  • Release quick inference example for RoboBrain-X0
  • Release training and evaluation codes for RoboBrain-X0
  • Release omnisat-based RoboBrain-X0 full version (coming soon)

πŸš€ Features

RoboBrain-X0 supports the unified modeling of heterogeneous ontologies and offers zero-shot generalization and complex task decomposition capabilities. Building on RoboBrain's multimodal foundation, RoboBrain-X0 further integrates real-world robot motion data based on RoboBrain-2.0 data. By unifying vision, language, and motion modeling, it achieves cross-ontology generalization and adaptation, providing integrated capabilities from perception to execution.

⭐️ Architecture

This model includes RoboBrain 2.0 and OmniSAT (action tokenizer). Based on RoboBrain 2.0, the model is trained on a large amount of real-world robotics data and embodied reasoning data, enabling it to possess general robotic manipulation capabilities. The action token sequences output by the model are converted into underlying robot control signals through our proprietary action tokenizer. Model details are as follows:

  • Multimodal Input: The model accepts single-image, multi-image, and text input (covering pointing task scenarios, object maneuverability scenarios, trajectory scenarios, and subtask execution scenarios), and produces outputs of varying dimensions based on the input scenarios.
  • Action Generation and Execution: After model processing, OmniSAT converts these into multi-degree-of-freedom (DoF) action sequences, ultimately driving the robot to complete the operation.

πŸ€— Model Zoo

Models Checkpoint Description
RoboBrain-X0-preview πŸ€— BAAI/RoboBrain-X0-Preview preview version of the RoboBrain-X0
RoboBrain-X0-FlagOS πŸ€— FlagRelease/RoboBrain-X0-FlagOS multi-chip version of the RoboBrain-X0
RoboBrain-X0-Dataset πŸ€— BAAI/RoboBrain-X0-Dataset(9.30) training dataset of RoboBrain-X0

πŸ› οΈ Setup

# Pull Docker Image.
docker pull ghcr.io/robobrain-roboos-robotic/robotics_pretrain_flagscale:cuda12.4.1-cudnn9.5.0-python3.12-torch2.6.0-time250928-ssh

# Run Container.
docker run -itd \
  --name robotics_pretrain \
  --privileged \
  --gpus all \
  --net=host \
  --ipc=host \
  --device=/dev/infiniband \
  --shm-size 512g \
  --ulimit memlock=-1 \
  -v /nfs/hcr/models/:/models \
  ghcr.io/robobrain-roboos-robotic/robotics_pretrain_flagscale:cuda12.4.1-cudnn9.5.0-python3.12-torch2.6.0-time250928-ssh

πŸ€– Training

cd /root/robotics_pretrain/flag-scale
conda activate flagscale-train

python run.py \
  --config-path ./examples/qwen2_5_vl/conf \
  --config-name train_3b_action_S6_subtask_agilex_eval5_demo \
  action=run

πŸ” Evaluation

Note: Please refer to RoboBrain-X0 Github for the usage of RoboBrain-X0

πŸ“‘ Citation

If you find this project useful, welcome to cite us.

@article{RoboBrain1.0,
    title={Robobrain: A unified brain model for robotic manipulation from abstract to concrete},
    author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
    journal={arXiv preprint arXiv:2502.21257},
    year={2025}
}

@article{RoboBrain2.0TechnicalReport,
    title={RoboBrain 2.0 Technical Report},
    author={BAAI RoboBrain Team},
    journal={arXiv preprint arXiv:2507.02029},
    year={2025}
}

@article{zhou2025roborefer,
    title={RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics},
    author={Zhou, Enshen and An, Jingkun and Chi, Cheng and Han, Yi and Rong, Shanyu and Zhang, Chi and Wang, Pengwei and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and others},
    journal={arXiv preprint arXiv:2506.04308},
    year={2025}
}

@article{Reason-RFT,
    title={Reason-rft: Reinforcement fine-tuning for visual reasoning},
    author={Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Lin, Minglan and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2503.20752},
    year={2025}
}
Downloads last month
126
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Video Preview
loading

Model tree for BAAI/RoboBrain-X0-Preview

Quantizations
2 models

Collection including BAAI/RoboBrain-X0-Preview