GUIrilla commited on
Commit
8001523
·
verified ·
1 Parent(s): c579765

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -3
README.md CHANGED
@@ -7,9 +7,115 @@ tags:
7
  - trl
8
  - sft
9
  license: mit
 
 
10
  ---
11
 
12
- # Model Card for GUIrilla-See-3B
13
 
14
- This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct).
15
- It has been trained using [TRL](https://github.com/huggingface/trl).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - trl
8
  - sft
9
  license: mit
10
+ datasets:
11
+ - GUIrilla/GUIrilla-Task
12
  ---
13
 
 
14
 
15
+ # GUIrilla-See-3B
16
+
17
+ *Vision–language grounding for graphical user interfaces*
18
+
19
+ ---
20
+
21
+ ## Summary
22
+
23
+ GUIrilla-See-3B is a 3 billion-parameter **Qwen 2.5-VL** model fine-tuned to locate on-screen elements of macOS GUI.
24
+ Given a screenshot and a natural-language task, the model returns a single point **(x, y)** that lies at (or very near) the centre of the referenced region.
25
+
26
+ ---
27
+
28
+ ## Quick-start
29
+
30
+ ```python
31
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
32
+ import torch, PIL.Image as Image
33
+
34
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
35
+ "GUIrilla/GUIrilla-See-3B",
36
+ torch_dtype="auto",
37
+ device_map="auto",
38
+ attn_implementation="flash_attention_2",
39
+ trust_remote_code=True,
40
+ )
41
+ processor = AutoProcessor.from_pretrained(
42
+ "GUIrilla/GUIrilla-See-3B",
43
+ trust_remote_code=True,
44
+ use_fast=True,
45
+ )
46
+
47
+ image = Image.open("screenshot.png")
48
+ task = "the search field in the top-right corner"
49
+
50
+ conversation = [{
51
+ "role": "user",
52
+ "content": [
53
+ {"type": "image", "image": image},
54
+ {"type": "text",
55
+ "text": (
56
+ "Your task is to help the user identify the precise coordinates "
57
+ "(x, y) of a specific area/element/object on the screen based on "
58
+ "a description.\n"
59
+ "- Your response should aim to point to the centre or a representative "
60
+ "point within the described area/element/object as accurately as possible.\n"
61
+ "- If the description is unclear or ambiguous, infer the most relevant area "
62
+ "or element based on its likely context or purpose.\n"
63
+ "- Your answer should be a single string (x, y) corresponding to the point "
64
+ "of interest.\n"
65
+ f"\nDescription: {task}"
66
+ "\nAnswer:"
67
+ )},
68
+ ],
69
+ }]
70
+
71
+ texts = processor.apply_chat_template(conversation, tokenize=False,
72
+ add_generation_prompt=True)
73
+ image_inputs = [image]
74
+ inputs = processor(text=texts, images=image_inputs,
75
+ return_tensors="pt", padding=True).to(model.device)
76
+
77
+ with torch.no_grad():
78
+ output_ids = model.generate(**inputs, max_new_tokens=16, num_beams=3)
79
+
80
+ generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
81
+ answer = processor.batch_decode(generated_ids,
82
+ skip_special_tokens=True)[0]
83
+ print("Predicted click:", answer) # → "(812, 115)"
84
+ ```
85
+
86
+ ---
87
+
88
+ ## Training Data
89
+
90
+ Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task).
91
+
92
+ * **Train data:** 25,606 tasks across 881 macOS applications (5% of apps from it for validation)
93
+ * **Test data:** 1,565 tasks across 227 macOS applications
94
+
95
+ ---
96
+
97
+ ## Training Procedure
98
+
99
+ * 2 epochs LoRA fine-tuning on 2 × H100 80 GB.
100
+ * Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 2 e-5 with cosine decay and 0.05 warm up ration.
101
+
102
+ ---
103
+
104
+ ## Evaluation
105
+
106
+ | Split | Success Rate % |
107
+ | ----- | ---------------|
108
+ | Test | **73.48** |
109
+
110
+ ---
111
+
112
+ ## Ethical & Safety Notes
113
+
114
+ * Always sandbox or use confirmation steps when connecting the model to real GUIs.
115
+ * Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
116
+
117
+ ---
118
+
119
+ ## License
120
+
121
+ MIT (see `LICENSE`).