GUIrilla
/

GUIrilla-See-3B

@@ -7,9 +7,115 @@ tags:
 - trl
 - sft
 license: mit
 ---
-# Model Card for GUIrilla-See-3B
-This model is a fine-tuned version of [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct).
-It has been trained using [TRL](https://github.com/huggingface/trl).

 - trl
 - sft
 license: mit
+datasets:
+- GUIrilla/GUIrilla-Task
 ---
+# GUIrilla-See-3B
+*Vision–language grounding for graphical user interfaces*
+---
+## Summary
+GUIrilla-See-3B is a 3 billion-parameter **Qwen 2.5-VL** model fine-tuned to locate on-screen elements of macOS GUI.
+Given a screenshot and a natural-language task, the model returns a single point **(x, y)** that lies at (or very near) the centre of the referenced region.
+---
+## Quick-start
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+import torch, PIL.Image as Image
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "GUIrilla/GUIrilla-See-3B",
+    torch_dtype="auto",
+    device_map="auto",
+    attn_implementation="flash_attention_2",
+    trust_remote_code=True,
+)
+processor = AutoProcessor.from_pretrained(
+    "GUIrilla/GUIrilla-See-3B",
+    trust_remote_code=True,
+    use_fast=True,
+)
+image = Image.open("screenshot.png")
+task  = "the search field in the top-right corner"
+conversation = [{
+    "role": "user",
+    "content": [
+        {"type": "image", "image": image},
+        {"type": "text",
+         "text": (
+             "Your task is to help the user identify the precise coordinates "
+             "(x, y) of a specific area/element/object on the screen based on "
+             "a description.\n"
+             "- Your response should aim to point to the centre or a representative "
+             "point within the described area/element/object as accurately as possible.\n"
+             "- If the description is unclear or ambiguous, infer the most relevant area "
+             "or element based on its likely context or purpose.\n"
+             "- Your answer should be a single string (x, y) corresponding to the point "
+             "of interest.\n"
+             f"\nDescription: {task}"
+             "\nAnswer:"
+         )},
+    ],
+}]
+texts        = processor.apply_chat_template(conversation, tokenize=False,
+                                             add_generation_prompt=True)
+image_inputs = [image]
+inputs       = processor(text=texts, images=image_inputs,
+                         return_tensors="pt", padding=True).to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(**inputs, max_new_tokens=16, num_beams=3)
+generated_ids = output_ids[:, inputs.input_ids.shape[1]:]
+answer = processor.batch_decode(generated_ids,
+                                skip_special_tokens=True)[0]
+print("Predicted click:", answer)      # → "(812, 115)"
+```
+---
+## Training Data
+Trained on [GUIrilla-Task](https://huggingface.co/datasets/GUIrilla/GUIrilla-Task).
+* **Train data:** 25,606 tasks across 881 macOS applications (5% of apps from it for validation)
+* **Test data:**  1,565 tasks across 227 macOS applications
+---
+## Training Procedure
+* 2 epochs LoRA fine-tuning on 2 × H100 80 GB.
+* Optimiser – AdamW (β₁ = 0.9, β₂ = 0.95), LR = 2 e-5 with cosine decay and 0.05 warm up ration.
+---
+## Evaluation
+| Split | Success Rate % |
+| ----- | ---------------|
+| Test  | **73.48**      |
+---
+## Ethical & Safety Notes
+* Always sandbox or use confirmation steps when connecting the model to real GUIs.
+* Screenshots may reveal sensitive data – ensure compliance with privacy regulations.
+---
+## License
+MIT (see `LICENSE`).