nielsr HF Staff commited on
Commit
dff9ba0
·
verified ·
1 Parent(s): 43d71c3

Improve model card: Add metadata, links, and detailed introduction

Browse files

This PR improves the model card by:
- Adding the `pipeline_tag: image-text-to-text`, ensuring the model can be found via the correct pipeline filter on the Hub.
- Adding relevant `tags: gui-agent, visual-grounding, reinforcement-learning` for better discoverability.
- Specifying the `license: apache-2.0`.
- Adding a prominent top-level title `# GTA1: GUI Test-time Scaling Agent`.
- Including direct links to the paper and code repository at the top of the card.
- Expanding the "Introduction" section with a more detailed overview of the model from the paper's abstract and GitHub README.
- Removing the boilerplate "File information" section.

Files changed (1) hide show
  1. README.md +41 -6
README.md CHANGED
@@ -1,13 +1,24 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
4
  ---
5
 
6
- # Introduction
7
 
8
- Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1), we share state-of-the-art GUI grounding models trained using GRPO.
9
 
10
- # Performance
 
 
 
 
 
 
11
 
12
  We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
13
 
@@ -36,7 +47,7 @@ We follow the standard evaluation protocol and benchmark our model on three chal
36
  > - UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
37
  > - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
38
 
39
- # Inference
40
  Below is a code snippet demonstrating how to run inference using a trained model.
41
 
42
  ```python
@@ -125,4 +136,28 @@ pred_y*=scale_y
125
  print(pred_x,pred_y)
126
  ```
127
 
128
- Refer to our [code](https://github.com/Yan98/GTA1) for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - gui-agent
5
+ - visual-grounding
6
+ - reinforcement-learning
7
+ pipeline_tag: image-text-to-text
8
+ license: apache-2.0
9
  ---
10
 
11
+ # GTA1: GUI Test-time Scaling Agent
12
 
13
+ 📚 [Paper](https://huggingface.co/papers/2507.05791) | 💻 [Code](https://github.com/Yan98/GTA1) | 📝 [Blog](https://huggingface.co/blog/HelloKKMe/grounding-r1)
14
 
15
+ ## Introduction
16
+
17
+ Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets.
18
+
19
+ This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld).
20
+
21
+ ## Performance
22
 
23
  We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
24
 
 
47
  > - UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
48
  > - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
49
 
50
+ ## Inference
51
  Below is a code snippet demonstrating how to run inference using a trained model.
52
 
53
  ```python
 
136
  print(pred_x,pred_y)
137
  ```
138
 
139
+ ## Agent Performance
140
+
141
+ Refer to an inference example [here](https://github.com/xlang-ai/OSWorld/pull/246/files#diff-2b758e4fafd9a52ee08bd6072f64297e4d880193fcf3f0e480da954a6711afa7).
142
+
143
+ ## Contact
144
+
145
+ Please contact `[email protected]` for any queries.
146
+
147
+ ## Acknowledgement
148
+
149
+ We thank the open-source projects: [VLM-R1](https://github.com/om-ai-lab/VLM-R1), [Jedi](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/jedi_7b_agent.py), and [Agent-S2](https://github.com/simular-ai/Agent-S).
150
+
151
+ ## Citation
152
+ If you use this repository or find it helpful in your research, please cite it as follows:
153
+ ```bibtex
154
+ @misc{yang2025gta1guitesttimescaling,
155
+ title={GTA1: GUI Test-time Scaling Agent},
156
+ author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Caiming Xiong and Junnan Li},
157
+ year={2025},
158
+ eprint={2507.05791},
159
+ archivePrefix={arXiv},
160
+ primaryClass={cs.AI},
161
+ url={https://arxiv.org/abs/2507.05791},
162
+ }
163
+ ```