Introduction

JarvisVLA-Qwen2-VL-7B is a Visual-Language-Action (VLA) model specifically tailored for the open-world game Minecraft. Based on human language instructions, JarvisVLA-Qwen2-VL-7B masters thousands of in-game skills, empowering endless creativity and interaction in Minecraft’s expansive universe!

Citation


@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and Xiaojian Ma and Yitao Liang},
  year    = {2025}
}
Downloads last month
2
Safetensors
Model size
8.29B params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for CraftJarvis/JarvisVLA-Qwen2-VL-7B

Base model

Qwen/Qwen2-VL-7B
Finetuned
(7)
this model

Dataset used to train CraftJarvis/JarvisVLA-Qwen2-VL-7B

Collection including CraftJarvis/JarvisVLA-Qwen2-VL-7B