Introduction

JarvisVLA-Qwen2-VL-7B is a Visual-Language-Action (VLA) model specifically tailored for the open-world game Minecraft. Based on human language instructions, JarvisVLA-Qwen2-VL-7B masters thousands of in-game skills, empowering endless creativity and interaction in Minecraft’s expansive universe!

Paper: https://craftjarvis.github.io/JarvisVLA/files/jarvis_vla_draft.pdf
Github: https://github.com/CraftJarvis/JarvisVLA
Project: https://craftjarvis.github.io/JarvisVLA/

Citation


@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and Xiaojian Ma and Yitao Liang},
  year    = {2025}
}

Downloads last month: 624

Safetensors

Model size

8.29B params

Tensor type

F16

Model tree for CraftJarvis/JarvisVLA-Qwen2-VL-7B

Base model

Qwen/Qwen2-VL-7B

Finetuned

(19)

this model

Quantizations

2 models

Dataset used to train CraftJarvis/JarvisVLA-Qwen2-VL-7B

Collection including CraftJarvis/JarvisVLA-Qwen2-VL-7B

JARVIS-VLA-v1

Collection

Vision-Language-Action Models in Minecraft. • 4 items • Updated Mar 22 • 11