This model is covnert by mlx_vlm from HelloKKMe/GTA1-7B.

Model Description

GTA1-7B is a SOTA GUI grounding model, trained based on UI-TARS-1.5-7B for GUI agent tasks. It establishes state-of-the-art performance across diverse benchmarks, achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G benchmarks.

Model Size Open Source ScreenSpot-V2 ScreenSpotPro OSWORLD-G
OpenAI CUA β€” ❌ 87.9 23.4 β€”
Claude 3.7 β€” ❌ 87.6 27.7 β€”
JEDI-7B 7B βœ… 91.7 39.5 54.1
SE-GUI 7B βœ… 90.3 47.0 β€”
UI-TARS 7B βœ… 91.6 35.7 47.5
UI-TARS-1.5* 7B βœ… 89.7* 42.0* 64.2*
UGround-v1-7B 7B βœ… β€” 31.1 36.4
Qwen2.5-VL-32B-Instruct 32B βœ… 91.9* 48.0 59.6*
UGround-v1-72B 72B βœ… β€” 34.5 β€”
Qwen2.5-VL-72B-Instruct 72B βœ… 94.00* 53.3 62.2*
UI-TARS 72B βœ… 90.3 38.1 β€”
GTA1 (Ours) 7B βœ… 92.4 (βˆ† +2.7) 50.1(βˆ† +8.1) 67.7 (βˆ† +3.5)
GTA1 (Ours) 32B βœ… 93.2 (βˆ† +1.3) 53.6 (βˆ† +5.6) 61.9(βˆ† +2.3)
GTA1 (Ours) 72B βœ… 94.8(βˆ† +0.8) 58.4 (βˆ† +5.1) 66.7(βˆ† +4.5)

Note:

  • The base models of GTA1-32B/72B are Qwen2.5-VL-32B/72-Instruct.

Quick Start

mlx_vlm.generate --model mlx-community/GTA1-7B-4bit \
  --max-tokens 1024 \
  --temperature 0.0 \
  --prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
  --image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/GTA1-7B-4bit

Base model

HelloKKMe/GTA1-7B
Quantized
(6)
this model