LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
Abstract
With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.
Community
🔥 Must-read papers for LLM-Powered Phone GUI Agents:github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents
🪧 Milestones
Milestones in the development of LLM-powered phone GUI agents. This figure divides advancements into four primary parts: Prompt Engineering, Training-Based Methods, Datasets and Benchmarks. Prompt Engineering leverages pre-trained LLMs by strategically crafting input prompts, to perform specific tasks without modifying model parameters. In contrast, Training-Based Methods, involve adapting LLMs via supervised fine-tuning or reinforcement learning on GUI-specific data, thereby enhancing their ability to understand and interact with mobile UIs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AppAgentX: Evolving GUI Agents as Proficient Smartphone Users (2025)
- Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models (2025)
- A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models (2025)
- LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark (2025)
- Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents (2025)
- UFO2: The Desktop AgentOS (2025)
- API Agents vs. GUI Agents: Divergence and Convergence (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper