arxiv:2504.19838

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Published on Apr 28

· Submitted by

lgy0404 on Apr 29

#3 Paper of the day

Upvote

Authors:

Guangyi Liu ,

Pengxiang Zhao ,

Liang Liu ,

Yaxuan Guo ,

Weifeng Lin ,

Yuxiang Chai ,

Hongsheng Li

Abstract

With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.

View arXiv page View PDF Project page Add to collection

Community

lgy0404

Paper submitter 1 day ago

🔥 Must-read papers for LLM-Powered Phone GUI Agents：github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

lgy0404

Paper submitter 1 day ago

🔖 General Overview
A comprehensive taxonomy of LLM-powered phone GUI agents in phone automation. Note that only a selection of representative works is included in this categorization.

lgy0404

Paper submitter 1 day ago

🪧 Milestones
Milestones in the development of LLM-powered phone GUI agents. This figure divides advancements into four primary parts: Prompt Engineering, Training-Based Methods, Datasets and Benchmarks. Prompt Engineering leverages pre-trained LLMs by strategically crafting input prompts, to perform specific tasks without modifying model parameters. In contrast, Training-Based Methods, involve adapting LLMs via supervised fine-tuning or reinforcement learning on GUI-specific data, thereby enhancing their ability to understand and interact with mobile UIs.