Abstract
People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Large Language Models are Visual Reasoning Coordinators (2023)
- PILL: Plug Into LLM with Adapter Expert and Attention Gate (2023)
- VIM: Probing Multimodal Large Language Models for Visual Embedded Instruction Following (2023)
- InfMLLM: A Unified Framework for Visual-Language Tasks (2023)
- ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
CogAgent: Revolutionizing GUI Interaction with Visual Language Models!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper