Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts
Abstract
Optimus-3, a multimodal large language model agent, uses knowledge-enhanced data generation, a Mixture-of-Experts architecture, and multimodal reasoning-augmented reinforcement learning to achieve superior performance across various tasks in Minecraft.
Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/
Community
A generalist multimodal agent in Minecraft. Project page: https://cybertronagent.github.io/Optimus-3.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning (2025)
- Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward (2025)
- LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models (2025)
- Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning (2025)
- VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning (2025)
- UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning (2025)
- SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper