qianhuiwu commited on
Commit
b9ba42c
Β·
verified Β·
1 Parent(s): 504ec39

Add model card

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-3B-Instruct
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
+ ---
8
+
9
+ # GUI-Actor-7B with Qwen2.5-VL-7B as backbone VLM
10
+
11
+ This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://huggingface.co/papers/2506.03143).
12
+ It is developed based on [Qwen2.5-VL-3B-Instruct ](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
13
+
14
+ For more details on model design and evaluation, please check: [🏠 Project Page](https://microsoft.github.io/GUI-Actor/) | [πŸ’» Github Repo](https://github.com/microsoft/GUI-Actor) | [πŸ“‘ Paper](https://www.arxiv.org/pdf/2506.03143).
15
+
16
+ | Model Name | Hugging Face Link |
17
+ |--------------------------------------------|--------------------------------------------|
18
+ | **GUI-Actor-7B-Qwen2-VL** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL) |
19
+ | **GUI-Actor-2B-Qwen2-VL** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL) |
20
+ | **GUI-Actor-7B-Qwen2.5-VL (coming soon)** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL) |
21
+ | **GUI-Actor-3B-Qwen2.5-VL (coming soon)** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL) |
22
+ | **GUI-Actor-Verifier-2B** | [πŸ€— Hugging Face](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B) |
23
+
24
+ ## πŸ“Š Performance Comparison on GUI Grounding Benchmarks
25
+ Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
26
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
27
+ |------------------|--------------|----------------|------------|----------------|
28
+ | **_72B models:_**
29
+ | AGUVIS-72B | Qwen2-VL | - | 89.2 | - |
30
+ | UGround-V1-72B | Qwen2-VL | 34.5 | **89.4** | - |
31
+ | UI-TARS-72B | Qwen2-VL | **38.1** | 88.4 | **90.3** |
32
+ | **_7B models:_**
33
+ | OS-Atlas-7B | Qwen2-VL | 18.9 | 82.5 | 84.1 |
34
+ | AGUVIS-7B | Qwen2-VL | 22.9 | 84.4 | 86.0† |
35
+ | UGround-V1-7B | Qwen2-VL | 31.1 | 86.3 | 87.6† |
36
+ | UI-TARS-7B | Qwen2-VL | 35.7 | **89.5** | **91.6** |
37
+ | GUI-Actor-7B | Qwen2-VL | **40.7** | 88.3 | 89.5 |
38
+ | GUI-Actor-7B + Verifier | Qwen2-VL | 44.2 | 89.7 | 90.9 |
39
+ | **_2B models:_**
40
+ | UGround-V1-2B | Qwen2-VL | 26.6 | 77.1 | - |
41
+ | UI-TARS-2B | Qwen2-VL | 27.7 | 82.3 | 84.7 |
42
+ | GUI-Actor-2B | Qwen2-VL | **36.7** | **86.5** | **88.6** |
43
+ | GUI-Actor-2B + Verifier | Qwen2-VL | 41.8 | 86.9 | 89.3 |
44
+
45
+ Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
46
+ | Method | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
47
+ |----------------|---------------|----------------|----------------|
48
+ | **_7B models:_**
49
+ | Qwen2.5-VL-7B | Qwen2.5-VL | 27.6 | 88.8 |
50
+ | Jedi-7B | Qwen2.5-VL | 39.5 | 91.7 |
51
+ | GUI-Actor-7B | Qwen2.5-VL | **44.6** | **92.1** |
52
+ | GUI-Actor-7B + Verifier | Qwen2.5-VL | 47.7 | 92.5 |
53
+ | **_3B models:_**
54
+ | Qwen2.5-VL-3B | Qwen2.5-VL | 25.9 | 80.9 |
55
+ | Jedi-3B | Qwen2.5-VL | 36.1 | 88.6 |
56
+ | GUI-Actor-3B | Qwen2.5-VL | **42.2** | **91.0** |
57
+ | GUI-Actor-3B + Verifier | Qwen2.5-VL | 45.9 | 92.4 |
58
+
59
+ ## πŸš€ Usage
60
+ ```python
61
+ import torch
62
+
63
+ from qwen_vl_utils import process_vision_info
64
+ from datasets import load_dataset
65
+ from transformers import Qwen2VLProcessor
66
+ from gui_actor.constants import chat_template
67
+ from gui_actor.modeling_qwen25vl import Qwen2_5_VLForConditionalGenerationWithPointer
68
+ from gui_actor.inference import inference
69
+
70
+
71
+ # load model
72
+ model_name_or_path = "microsoft/GUI-Actor-3B-Qwen2.5-VL"
73
+ data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
74
+ tokenizer = data_processor.tokenizer
75
+ model = Qwen2_5_VLForConditionalGenerationWithPointer.from_pretrained(
76
+ model_name_or_path,
77
+ torch_dtype=torch.bfloat16,
78
+ device_map="cuda:0",
79
+ attn_implementation="flash_attention_2"
80
+ ).eval()
81
+
82
+ # prepare example
83
+ dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
84
+ example = dataset[0]
85
+ print(f"Intruction: {example['instruction']}")
86
+ print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
87
+
88
+ conversation = [
89
+ {
90
+ "role": "system",
91
+ "content": [
92
+ {
93
+ "type": "text",
94
+ "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
95
+ }
96
+ ]
97
+ },
98
+ {
99
+ "role": "user",
100
+ "content": [
101
+ {
102
+ "type": "image",
103
+ "image": example["image"], # PIL.Image.Image or str to path
104
+ # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
105
+ },
106
+ {
107
+ "type": "text",
108
+ "text": example["instruction"]
109
+ },
110
+ ],
111
+ },
112
+ ]
113
+
114
+ # inference
115
+ pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
116
+ px, py = pred["topk_points"][0]
117
+ print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")
118
+
119
+ # >> Model Response
120
+ # Intruction: close this window
121
+ # ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
122
+ # Predicted click point: [0.9709, 0.1548]
123
+ ```
124
+
125
+ ## πŸ“ Citation
126
+ ```
127
+ @article{wu2025guiactor,
128
+ title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
129
+ author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
130
+ year={2025},
131
+ eprint={2506.03143},
132
+ archivePrefix={arXiv},
133
+ primaryClass={cs.CV},
134
+ url={https://www.arxiv.org/pdf/2506.03143},
135
+ }
136
+ ```