BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
Abstract
BridgeVLA is a 3D vision-language-action model that projects 3D inputs to 2D images and uses 2D heatmaps for efficient and effective action prediction, outperforming baselines in various benchmarks.
Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/
Community
π₯ Can we combine 2D VLA generalization with 3D policy efficiency?
Introducing BridgeVLA β a 3D Visual Language Action model bridging pretrained VLM backbones and 3D VLAs. Reusing VLM weights isnβt enough β it needs smarter design.
π Results:
Β· 1st on RLBench, COLOSSEUM, GemBench π
Β· +32% real-world performance over baselines π§
Β· 96.8% success with 3 demo trajectories π±
π¦ Code, data, models open-source.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation (2025)
- 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks (2025)
- Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation (2025)
- GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data (2025)
- NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks (2025)
- VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation (2025)
- GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 6
Browse 6 datasets citing this paperSpaces citing this paper 0
No Space linking this paper