BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation

[paper] [model] [code]

Open Source Plan

  • ✅ Paper, Pre-trained VLM and evaluation code.
  • 🧭 Fine-tuned VLA models, pre-training and fine-tuning code.
  • 🧭 Pre-trained VLA.

Evaluation on VQA

We use the LMM-Eval toolkit to conduct evaluations on VQA tasks. We provide the transformers repo in which we modify the modeling_llava.py and modeling_siglip.py to support the W1.58-A8 quantization.

The evaluation should use nvidia_24_07 docker. Install the packages:

docker run --name nvidia_24_07  --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity # only use for multimodal evaluation
docker exec -it nvidia_24_07 bash
git clone https://github.com/ustcwhy/BitVLA.git
cd BitVLA/
bash vl_eval_setup.sh # only use for multimodal evaluation

First, download the BitVLA model from HuggingFace:

git clone https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16 # BitVLA w/ W1.58-A8 SigLIP-L
git clone https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16 # BitVLA w/ BF16 SigLIP-L

Then run the following scripts to conduct evaluations:

cd lmms-eval/
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-bitsiglipL-224px-bf16
bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16

Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the bitnet.cpp inference framework to accurately measure the reduction in inference cost.

Acknowledgement

This repository is built using LMM-Eval and the HuggingFace's transformers.

Citation

If you find this repository useful, please consider citing our work:

@article{bitvla,
  title={BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation}, 
  author={Hongyu Wang and Chuyan Xiong and Ruiping Wang and Xilin Chen},
  year={2025},
  eprint={2506.07530},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
}

License

This project is licensed under the MIT License.

Contact Information

For help or issues using models, please submit a GitHub issue.

Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hongyuw/bitvla-siglipL-224px-bf16

Finetuned
(12)
this model

Datasets used to train hongyuw/bitvla-siglipL-224px-bf16

Collection including hongyuw/bitvla-siglipL-224px-bf16