UI-Venus Technical Report: Building High-performance UI Agents with RFT
Abstract
UI-Venus, a multimodal large language model-based UI agent, achieves state-of-the-art performance in UI grounding and navigation tasks using reinforcement fine-tuning and novel self-evolving frameworks.
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5.To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models.To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies.To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment \& Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/antgroup/UI-Venus.
Community
🔥🔥🔥Code: https://github.com/inclusionAI/UI-Venus
🚀🚀🚀Models: https://huggingface.co/collections/inclusionAI/ui-venus-689f2fb01a4234cbce91c56a
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MagicGUI: A Foundational Mobile GUI Agent with Scalable Data Pipeline and Reinforcement Fine-tuning (2025)
- NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks (2025)
- UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding (2025)
- MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment (2025)
- GTA1: GUI Test-time Scaling Agent (2025)
- InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization (2025)
- MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper