Update README.md
Browse files
README.md
CHANGED
@@ -7,9 +7,115 @@ tags:
|
|
7 |
- trl
|
8 |
- sft
|
9 |
license: mit
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
# Model Card for GUIrilla-See-3B
|
13 |
|
14 |
-
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
- trl
|
8 |
- sft
|
9 |
license: mit
|
10 |
+
datasets:
|
11 |
+
- GUIrilla/GUIrilla-Task
|
12 |
---
|
13 |
|
|
|
14 |
|
15 |
+
# GUIrilla-See-3B
|
16 |
+
|
17 |
+
*Vision–language grounding for graphical user interfaces*
|
18 |
+
|
19 |
+
---
|
20 |
+
|
21 |
+
## Summary
|
22 |
+
|
23 |
+
GUIrilla-See-3B is a 3 billion-parameter **Qwen 2.5-VL** model fine-tuned to locate on-screen elements of macOS GUI.
|
24 |
+
Given a screenshot and a natural-language task, the model returns a single point **(x, y)** that lies at (or very near) the centre of the referenced region.
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
## Quick-start
|
29 |
+
|
30 |
+
```python
|
31 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
32 |
+
import torch, PIL.Image as Image
|
33 |
+
|
34 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
35 |
+
"GUIrilla/GUIrilla-See-3B",
|
36 |
+
torch_dtype="auto",
|
37 |
+
device_map="auto",
|
38 |
+
attn_implementation="flash_attention_2",
|
39 |
+
trust_remote_code=True,
|
40 |
+
)
|
41 |
+
processor = AutoProcessor.from_pretrained(
|
42 |
+
"GUIrilla/GUIrilla-See-3B",
|
43 |
+
trust_remote_code=True,
|
44 |
+
use_fast=True,
|
45 |
+
)
|
46 |
+
|
47 |
+
image = Image.open("screenshot.png")
|
48 |
+
task = "the search field in the top-right corner"
|
49 |
+
|
50 |
+
conversation = [{
|
51 |
+
"role": "user",
|
52 |
+
"content": [
|
53 |
+
{"type": "image", "image": image},
|
54 |
+
{"type": "text",
|
55 |
+
"text": (
|
56 |
+
"Your task is to help the user identify the precise coordinates "
|
57 |
+
"(x, y) of a specific area/element/object on the screen based on "
|
58 |
+
"a description.\n"
|
59 |
+
"- Your response should aim to point to the centre or a representative "
|
60 |
+
"point within the described area/element/object as accurately as possible.\n"
|
61 |
+
"- If the description is unclear or ambiguous, infer the most relevant area "
|
62 |
+
"or element based on its likely context or purpose.\n"
|
63 |
+
"- Your answer should be a single string (x, y) corresponding to the point "
|
64 |
+
"of interest.\n"
|
65 |
+
f"\nDescription: {task}"
|
66 |
+
"\nAnswer:"
|
67 |
+
)},
|
68 |
+
],
|
69 |
+
}]
|
70 |
+
|
71 |
+
texts = processor.apply_chat_template(conversation, tokenize=False,
|
72 |
+
add_generation_prompt=True)
|
73 |
+
image_inputs = [image]
|
74 |
+
inputs = processor(text=texts, images=image_inputs,
|
75 |
+
return_tensors="pt", padding=True).to(model.device)
|
76 |
+
|
77 |
+
with torch.no_grad():
|
78 |
+
output_ids = model.generate(**inputs, max_new_tokens=16, num_beams=3)
|
79 |
+
|
80 |
+
generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
|
81 |
+
answer = processor.batch_decode(generated_ids,
|
82 |
+
skip_special_tokens=True)[0]
|
83 |
+
print("Predicted click:", answer) # → "(812, 115)"
|
84 |
+
```
|
85 |
+
|
86 |
+
---
|
87 |
+
|
88 |
+
## Training Data
|
89 |
+
|
90 |
+
Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task).
|
91 |
+
|
92 |
+
* **Train data:** 25,606 tasks across 881 macOS applications (5% of apps from it for validation)
|
93 |
+
* **Test data:** 1,565 tasks across 227 macOS applications
|
94 |
+
|
95 |
+
---
|
96 |
+
|
97 |
+
## Training Procedure
|
98 |
+
|
99 |
+
* 2 epochs LoRA fine-tuning on 2 × H100 80 GB.
|
100 |
+
* Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 2 e-5 with cosine decay and 0.05 warm up ration.
|
101 |
+
|
102 |
+
---
|
103 |
+
|
104 |
+
## Evaluation
|
105 |
+
|
106 |
+
| Split | Success Rate % |
|
107 |
+
| ----- | ---------------|
|
108 |
+
| Test | **73.48** |
|
109 |
+
|
110 |
+
---
|
111 |
+
|
112 |
+
## Ethical & Safety Notes
|
113 |
+
|
114 |
+
* Always sandbox or use confirmation steps when connecting the model to real GUIs.
|
115 |
+
* Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
|
116 |
+
|
117 |
+
---
|
118 |
+
|
119 |
+
## License
|
120 |
+
|
121 |
+
MIT (see `LICENSE`).
|