File size: 8,002 Bytes
18076ed
 
 
 
 
 
 
 
 
f665794
ac3c3e4
18076ed
 
 
91c3e5f
6b250de
f665794
18076ed
bc48b53
 
 
 
 
 
 
 
f665794
18076ed
bc48b53
 
 
 
 
 
 
 
1fecc98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2cb25c
1fecc98
 
 
 
 
 
 
 
 
 
 
8908bdd
42b38ca
4a78521
42b38ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: mit
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: visual-question-answering
---

## Introduction
This repository contains the efficient GUI grounding model, **UI-R1-E-3B**, presented in [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620).

Project page: https://github.com/lll6gg/UI-R1

Old version: [UI-R1-3B](https://huggingface.co/LZXzju/Qwen2.5-VL-3B-UI-R1)

## Benchmark 1: ScreenSpotV2

| ScreenSpotV2  | inference mode | Mobile-T | Mobile-I | Desktop-T | Desktop-I | Web-T    | Web-I    | Avg↑ / Len↓        |
| ------------- | -------------- | -------- | -------- | --------- | --------- | -------- | -------- | ----------------- |
| OS-ATLAS-7B   | w/o thinking   | 95.2     | 75.8     | 90.7      | 63.6      | 90.6     | 77.3     | 84.1 /            |
| UI-TARS-7B    | w/o thinking   | 95.2     | 79.1     | 90.7      | 68.6      | 90.6     | 78.3     | 84.7 /            |
| UI-R1-3B (v1) | w/ thinking    | 96.2     | **84.3** | 92.3      | 63.6      | 89.2     | 75.4     | 85.4 / 67         |
| GUI-R1-3B     | w/ thinking    | 97.6     | 78.2     | 94.3      | 64.3      | 91.0     | 72.4     | 85.0 / 80         |
| UI-R1-3B (v2) | w/ thinking    | 97.6     | 79.6     | 92.3      | 67.9      | 88.9     | 77.8     | 85.8 / 60         |
| **UI-R1-E-3B**    | w/o thinking   | **98.2** | 83.9     | **94.8**  | **75.0**  | **93.2** | **83.7** | **89.5** / **28** |
## Benchmark 2: ScreenSpot-Pro

| ScreenSpot-Pro | inference mode | Average Length↓ | Average Accuracy↑ |
| -------------- | -------------- | --------------- | ---------------- |
| UGround-7B     | w/o thinking   | -               | 16.5             |
| OS-ATLAS-7B    | w/o thinking   | -               | 18.9             |
| UI-R1-3B (v1)  | w/ thinking    | 102             | 17.8             |
| GUI-R1-3B      | w/ thinking    | 114             | 26.6             |
| UI-R1-3B (v2)  | w/ thinking    | 129             | 29.8             |
| **UI-R1-E-3B**     | w/o thinking   | **28**          | **33.5**         |
## Leaderboard: UI-I2E-Bench
|     Model      | ScreenSpot | UI-I2E-Bench Avg | ScreenSpot-Pro | Avg  |
| :------------: | :--------: | :--------------: | :------------: | :--: |
| UI-TARS-1.5-7B |    88.1    |       73.2       |      42.2      | 67.8 |
| Uground-V1-72B |    89.7    |       76.3       |      34.3      | 66.8 |
|  UI-TARS-72B   |    88.4    |       73.7       |      38.1      | 66.7 |
|   **UI-R1-E-3B**   |    89.2    |       69.1       |      33.5      | 63.9 |
| Uground-V1-7B  |    87.1    |       70.3       |      31.1      | 62.8 |
|   InfiGUI-R1   |    87.5    |       69.7       |      29.6      | 62.3 |
|   UI-TARS-7B   |    89.5    |       61.4       |      35.7      | 62.2 |
| Qwen2.5-VL-72B |    87.1    |       51.4       |      43.6      | 60.7 |
| UI-I2E-VLM-7B  |    82.5    |       69.5       |      23.6      | 58.5 |
|   UI-TARS-2B   |    82.3    |        62        |      27.7      | 57.3 |
| Qwen2.5-VL-7B  |    84.7    |       53.8       |       29       | 55.8 |
| OmniParser-V2  |     72     |       54.8       |      39.6      | 55.5 |
| Uground-V1-2B  |    78.8    |       57.4       |      26.6      | 54.3 |
|  OS-Atlas-7B   |    82.5    |       58.6       |      18.9      | 53.3 |
|     **UI-R1-3B**      |    83.3    |       58.5       |      17.8      | 53.2 |
|   UGround-7B   |    74.1    |       54.2       |      16.5      | 48.3 |
| UI-I2E-VLM-4B  |    70.4    |       53.4       |      12.2      | 45.3 |
|   OmniParser   |    73.9    |       53.1       |      8.3       | 45.1 |
|   ShowUI-2B    |    76.8    |       41.5       |      7.7       |  42  |
| Qwen2.5-VL-3B  |    55.5    |       41.7       |      23.9      | 41.3 |
|   Aguvis-7B    |    84.4    |       53.2       |      22.9      | 40.4 |
|  OS-Atlas-4B   |    70.1    |       44.3       |      3.7       | 39.4 |
|  Qwen2-VL-7B   |    42.6    |       48.7       |      1.6       |  31  |
|    Seeclick    |    55.8    |       26.4       |      1.1       | 27.8 |
|  InternVL2-4B  |    4.2     |       0.9        |      0.3       | 1.8  |

## Evaluation Code for GUI Grounding

1. Generation for UI-R1-E-3B:

   ```python
   model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
       args.model_path,
       torch_dtype=torch.bfloat16,
       attn_implementation="flash_attention_2",
       device_map="cpu",
   )
   model = model.to(torch.device(rank))
   model = model.eval()
   processor = AutoProcessor.from_pretrained(ori_processor_path)
   question_template = (
       f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
       "Please provide the action to perform (enumerate in ['click'])"
       "and the coordinate where the cursor is moved to(integer) if click is performed.\n"
       "Output the final answer in <answer> </answer> tags directly."
       "The output answer format should be as follows:\n"
       "<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
       "Please strictly follow the format."
   )
   query = '<image>\n' + question_template
   messages = [
       {
           "role": "user",
           "content": [
               {"type": "image", "image": image_path}
           ] + [{"type": "text", "text": query}],
       }
   ]
   text = processor.apply_chat_template(
       messages, tokenize=False, add_generation_prompt=True
   )
   image_inputs, video_inputs = process_vision_info(messages)
   inputs = processor(
       text=[text],
       images=image_inputs,
       videos=video_inputs,
       padding=True,
       return_tensors="pt",
   )
   generated_ids = model.generate(**inputs, max_new_tokens=1024)
   generated_ids_trimmed = [
       out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
   ]
   response = processor.batch_decode(
       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
   )
   response = response[0]
   pred_coord, _ = extract_coord(response)
   ```

   

2. Rescale the predicted coordinate according to the image resize

   ```python
   image = Image.open(image_path)
   origin_width, origin_height = image.size
   resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
   scale_x = origin_width / resized_width
   scale_y = origin_height / resized_height
   pred_coord[0] = int(pred_coord[0] * scale_x)
   pred_coord[1] = int(pred_coord[1] * scale_y)
   ```

   Function smart_resize is from Qwen2VL:

   ```python
   import math
   def smart_resize(
       height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
   ):
       """Rescales the image so that the following conditions are met:
   
       1. Both dimensions (height and width) are divisible by 'factor'.
   
       2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
   
       3. The aspect ratio of the image is maintained as closely as possible.
   
       """
       if height < factor or width < factor:
           raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
       elif max(height, width) / min(height, width) > 200:
           raise ValueError(
               f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
           )
       h_bar = round(height / factor) * factor
       w_bar = round(width / factor) * factor
       if h_bar * w_bar > max_pixels:
           beta = math.sqrt((height * width) / max_pixels)
           h_bar = math.floor(height / beta / factor) * factor
           w_bar = math.floor(width / beta / factor) * factor
       elif h_bar * w_bar < min_pixels:
           beta = math.sqrt(min_pixels / (height * width))
           h_bar = math.ceil(height * beta / factor) * factor
           w_bar = math.ceil(width * beta / factor) * factor
       return h_bar, w_bar
   ```