This model is covnert by mlx_vlm from HelloKKMe/GTA1-7B.
Model Description
GTA1-7B is a SOTA GUI grounding model, trained based on UI-TARS-1.5-7B for GUI agent tasks. It establishes state-of-the-art performance across diverse benchmarks, achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G benchmarks.
| Model | Size | Open Source | ScreenSpot-V2 | ScreenSpotPro | OSWORLD-G |
|---|---|---|---|---|---|
| OpenAI CUA | β | β | 87.9 | 23.4 | β |
| Claude 3.7 | β | β | 87.6 | 27.7 | β |
| JEDI-7B | 7B | β | 91.7 | 39.5 | 54.1 |
| SE-GUI | 7B | β | 90.3 | 47.0 | β |
| UI-TARS | 7B | β | 91.6 | 35.7 | 47.5 |
| UI-TARS-1.5* | 7B | β | 89.7* | 42.0* | 64.2* |
| UGround-v1-7B | 7B | β | β | 31.1 | 36.4 |
| Qwen2.5-VL-32B-Instruct | 32B | β | 91.9* | 48.0 | 59.6* |
| UGround-v1-72B | 72B | β | β | 34.5 | β |
| Qwen2.5-VL-72B-Instruct | 72B | β | 94.00* | 53.3 | 62.2* |
| UI-TARS | 72B | β | 90.3 | 38.1 | β |
| GTA1 (Ours) | 7B | β | 92.4 (β +2.7) | 50.1(β +8.1) | 67.7 (β +3.5) |
| GTA1 (Ours) | 32B | β | 93.2 (β +1.3) | 53.6 (β +5.6) | 61.9(β +2.3) |
| GTA1 (Ours) | 72B | β | 94.8(β +0.8) | 58.4 (β +5.1) | 66.7(β +4.5) |
Note:
- The base models of GTA1-32B/72B are Qwen2.5-VL-32B/72-Instruct.
Quick Start
mlx_vlm.generate --model mlx-community/GTA1-7B-4bit \
--max-tokens 1024 \
--temperature 0.0 \
--prompt "List all contactsβ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
--image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for mlx-community/GTA1-7B-4bit
Base model
HelloKKMe/GTA1-7B