This model is covnert by mlx_vlm from HelloKKMe/GTA1-7B.

Model Description

GTA1-7B is a SOTA GUI grounding model, trained based on UI-TARS-1.5-7B for GUI agent tasks. It establishes state-of-the-art performance across diverse benchmarks, achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G benchmarks.

Model	Size	Open Source	ScreenSpot-V2	ScreenSpotPro	OSWORLD-G
OpenAI CUA	—	❌	87.9	23.4	—
Claude 3.7	—	❌	87.6	27.7	—
JEDI-7B	7B	✅	91.7	39.5	54.1
SE-GUI	7B	✅	90.3	47.0	—
UI-TARS	7B	✅	91.6	35.7	47.5
UI-TARS-1.5*	7B	✅	89.7*	42.0*	64.2*
UGround-v1-7B	7B	✅	—	31.1	36.4
Qwen2.5-VL-32B-Instruct	32B	✅	91.9*	48.0	59.6*
UGround-v1-72B	72B	✅	—	34.5	—
Qwen2.5-VL-72B-Instruct	72B	✅	94.00*	53.3	62.2*
UI-TARS	72B	✅	90.3	38.1	—
GTA1 (Ours)	7B	✅	92.4 _{(∆ +2.7)}	50.1_{(∆ +8.1)}	67.7 _{(∆ +3.5)}
GTA1 (Ours)	32B	✅	93.2 _{(∆ +1.3)}	53.6 _{(∆ +5.6)}	61.9_{(∆ +2.3)}
GTA1 (Ours)	72B	✅	94.8_{(∆ +0.8)}	58.4 _{(∆ +5.1)}	66.7_{(∆ +4.5)}

Note:

The base models of GTA1-32B/72B are Qwen2.5-VL-32B/72-Instruct.

Quick Start

mlx_vlm.generate --model mlx-community/GTA1-7B-4bit \
  --max-tokens 1024 \
  --temperature 0.0 \
  --prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
  --image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/GTA1-7B-4bit

Base model

HelloKKMe/GTA1-7B

Quantized

(6)

this model