File size: 7,926 Bytes
18076ed f665794 ac3c3e4 18076ed f665794 18076ed bc48b53 f665794 18076ed bc48b53 1fecc98 b2cb25c 1fecc98 8908bdd 42b38ca 4a78521 42b38ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
license: mit
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: visual-question-answering
---
## Introduction
This repository contains the efficient GUI grounding model, **UI-R1-E-3B**, presented in [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620).
Project page: https://github.com/lll6gg/UI-R1
## Benchmark 1: ScreenSpotV2
| ScreenSpotV2 | inference mode | Mobile-T | Mobile-I | Desktop-T | Desktop-I | Web-T | Web-I | Avg↑ / Len↓ |
| ------------- | -------------- | -------- | -------- | --------- | --------- | -------- | -------- | ----------------- |
| OS-ATLAS-7B | w/o thinking | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 / |
| UI-TARS-7B | w/o thinking | 95.2 | 79.1 | 90.7 | 68.6 | 90.6 | 78.3 | 84.7 / |
| UI-R1-3B (v1) | w/ thinking | 96.2 | **84.3** | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 / 67 |
| GUI-R1-3B | w/ thinking | 97.6 | 78.2 | 94.3 | 64.3 | 91.0 | 72.4 | 85.0 / 80 |
| UI-R1-3B (v2) | w/ thinking | 97.6 | 79.6 | 92.3 | 67.9 | 88.9 | 77.8 | 85.8 / 60 |
| **UI-R1-E-3B** | w/o thinking | **98.2** | 83.9 | **94.8** | **75.0** | **93.2** | **83.7** | **89.5** / **28** |
## Benchmark 2: ScreenSpot-Pro
| ScreenSpot-Pro | inference mode | Average Length↓ | Average Accuracy↑ |
| -------------- | -------------- | --------------- | ---------------- |
| UGround-7B | w/o thinking | - | 16.5 |
| OS-ATLAS-7B | w/o thinking | - | 18.9 |
| UI-R1-3B (v1) | w/ thinking | 102 | 17.8 |
| GUI-R1-3B | w/ thinking | 114 | 26.6 |
| UI-R1-3B (v2) | w/ thinking | 129 | 29.8 |
| **UI-R1-E-3B** | w/o thinking | **28** | **33.5** |
## Leaderboard: UI-I2E-Bench
| Model | ScreenSpot | UI-I2E-Bench Avg | ScreenSpot-Pro | Avg |
| :------------: | :--------: | :--------------: | :------------: | :--: |
| UI-TARS-1.5-7B | 88.1 | 73.2 | 42.2 | 67.8 |
| Uground-V1-72B | 89.7 | 76.3 | 34.3 | 66.8 |
| UI-TARS-72B | 88.4 | 73.7 | 38.1 | 66.7 |
| **UI-R1-E-3B** | 89.2 | 69.1 | 33.5 | 63.9 |
| Uground-V1-7B | 87.1 | 70.3 | 31.1 | 62.8 |
| InfiGUI-R1 | 87.5 | 69.7 | 29.6 | 62.3 |
| UI-TARS-7B | 89.5 | 61.4 | 35.7 | 62.2 |
| Qwen2.5-VL-72B | 87.1 | 51.4 | 43.6 | 60.7 |
| UI-I2E-VLM-7B | 82.5 | 69.5 | 23.6 | 58.5 |
| UI-TARS-2B | 82.3 | 62 | 27.7 | 57.3 |
| Qwen2.5-VL-7B | 84.7 | 53.8 | 29 | 55.8 |
| OmniParser-V2 | 72 | 54.8 | 39.6 | 55.5 |
| Uground-V1-2B | 78.8 | 57.4 | 26.6 | 54.3 |
| OS-Atlas-7B | 82.5 | 58.6 | 18.9 | 53.3 |
| **UI-R1-3B** | 83.3 | 58.5 | 17.8 | 53.2 |
| UGround-7B | 74.1 | 54.2 | 16.5 | 48.3 |
| UI-I2E-VLM-4B | 70.4 | 53.4 | 12.2 | 45.3 |
| OmniParser | 73.9 | 53.1 | 8.3 | 45.1 |
| ShowUI-2B | 76.8 | 41.5 | 7.7 | 42 |
| Qwen2.5-VL-3B | 55.5 | 41.7 | 23.9 | 41.3 |
| Aguvis-7B | 84.4 | 53.2 | 22.9 | 40.4 |
| OS-Atlas-4B | 70.1 | 44.3 | 3.7 | 39.4 |
| Qwen2-VL-7B | 42.6 | 48.7 | 1.6 | 31 |
| Seeclick | 55.8 | 26.4 | 1.1 | 27.8 |
| InternVL2-4B | 4.2 | 0.9 | 0.3 | 1.8 |
## Evaluation Code for GUI Grounding
1. Generation for UI-R1-E-3B:
```python
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
args.model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="cpu",
)
model = model.to(torch.device(rank))
model = model.eval()
processor = AutoProcessor.from_pretrained(ori_processor_path)
question_template = (
f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
"Please provide the action to perform (enumerate in ['click'])"
"and the coordinate where the cursor is moved to(integer) if click is performed.\n"
"Output the final answer in <answer> </answer> tags directly."
"The output answer format should be as follows:\n"
"<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
"Please strictly follow the format."
)
query = '<image>\n' + question_template
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path}
] + [{"type": "text", "text": query}],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
response = response[0]
pred_coord, _ = extract_coord(response)
```
2. Rescale the predicted coordinate according to the image resize
```python
image = Image.open(image_path)
origin_width, origin_height = image.size
resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
scale_x = origin_width / resized_width
scale_y = origin_height / resized_height
pred_coord[0] = int(pred_coord[0] * scale_x)
pred_coord[1] = int(pred_coord[1] * scale_y)
```
Function smart_resize is from Qwen2VL:
```python
import math
def smart_resize(
height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
):
"""Rescales the image so that the following conditions are met:
1. Both dimensions (height and width) are divisible by 'factor'.
2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
3. The aspect ratio of the image is maintained as closely as possible.
"""
if height < factor or width < factor:
raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
elif max(height, width) / min(height, width) > 200:
raise ValueError(
f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
)
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
if h_bar * w_bar > max_pixels:
beta = math.sqrt((height * width) / max_pixels)
h_bar = math.floor(height / beta / factor) * factor
w_bar = math.floor(width / beta / factor) * factor
elif h_bar * w_bar < min_pixels:
beta = math.sqrt(min_pixels / (height * width))
h_bar = math.ceil(height * beta / factor) * factor
w_bar = math.ceil(width * beta / factor) * factor
return h_bar, w_bar
```
|