This repository contains the model presented in GUI-G1: Understanding r1-zero-like training for visual grounding in gui agents.
Project page: https://github.com/Yuqi-Zhou/GUI-G1
Training Details:
- Training Dataset: We used the UI-R1-3B-Train to train our Qwen2.5-VL-3B-Instruct, which contains 101 samples with grounding annotations.
- Other: During training, the vision encoder was frozen. We used a learning rate of 1e-6, a sampling temperature of 0.9, and generated 8 outputs per prompt. Training was conducted on 4 L20 (48G) GPUs with 4 samples per GPU, beta set to 0, and a gradient accumulation step of 1.
Benchmark 1: ScreenSpotV2
ScreenSpotV2 | inference mode | Mobile-T | Mobile-I | Desktop-T | Desktop-I | Web-T | Web-I | Avg |
---|---|---|---|---|---|---|---|---|
OS-ATLAS-7B | w/o thinking | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 |
UI-TARS-7B | w/o thinking | 95.2 | 79.1 | 90.7 | 68.6 | 90.6 | 78.3 | 84.7 |
UI-R1-3B (v1) | w/ thinking | 96.2 | 84.3 | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 |
GUI-R1-3B | w/ thinking | 97.6 | 78.2 | 94.3 | 64.3 | 91.0 | 72.4 | 85.0 |
UI-R1-3B (v2) | w/ thinking | 97.6 | 79.6 | 92.3 | 67.9 | 88.9 | 77.8 | 85.8 |
UI-R1-E-3B | w/o thinking | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 |
GUI-G1-3B-0.1K | w/o thinking | 98.3 | 93.36 | 92.8 | 80.0 | 88.5 | 79.3 | 89.8 |
Benchmark 2: ScreenSpot-Pro
ScreenSpot-Pro | inference mode | Average Accuracy↑ |
---|---|---|
UGround-7B | w/o thinking | 16.5 |
OS-ATLAS-7B | w/o thinking | 18.9 |
UI-R1-3B (v1) | w/ thinking | 17.8 |
GUI-R1-3B | w/ thinking | 26.6 |
UI-R1-3B (v2) | w/ thinking | 29.8 |
UI-R1-E-3B | w/o thinking | 33.5 |
GUI-G1-3B-0.1K | w/o thinking | 43.9 |
Evaluation Code for GUI Grounding
Here we show a code snippet to show you how to use the chat model with transformers
and qwen_vl_utils
:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Yuqi-Zhou/GUI-G1-3B-0.1K", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
# "Yuqi-Zhou/GUI-G1-3B-0.1K",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("Yuqi-Zhou/GUI-G1-3B-0.1K")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Grounding instruction is:{Question}. Help to locate and output its bbox coordinates using JSON format."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128, use_cache=True)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Citation
If you find our work helpful, feel free to give us a cite.
@article{zhou2025gui,
title={GUI-G1: Understanding r1-zero-like training for visual grounding in gui agents},
author={Zhou, Yuqi and Dai, Sunhao and Wang, Shuai and Zhou, Kaiwen and Jia, Qinglin and Xu, Jun},
journal={arXiv preprint arXiv:2505.15810},
year={2025}
}
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support