Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills
Abstract
Hierarchical Multimodal Skills and Skill-Augmented Monte Carlo Tree Search improve multimodal GUI agent performance in long-horizon tasks by abstracting knowledge and bridging the offline-online domain gap.
Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32\%, 19\%, 15\%, and 79\% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/
Community
A train-free GUI agent with good performance in online environment. Project page: https://cybertronagent.github.io/Mirage-1.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System (2025)
- A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning (2025)
- Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (2025)
- GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior (2025)
- ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World (2025)
- InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners (2025)
- Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper