Image-Text-to-Text
Safetensors
llava_llama

UGround (The Initial LLaVA-based Version)

Update: We have trained stronger models based on Qwen2-VL with the same data. We suggest using them instead for better performance and more convenient training, inference and deployment.

UGround is a strong GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details. This work is a collaboration between OSUNLP and Orby AI. radar

Models

Release Plan

  • Model Weights
    • Initial V1 (the one used in the paper)
    • Qwen2-VL-based V1
      • 2B
      • 7B
      • 72B
    • V1.1
  • Code
    • Inference Code of UGround
    • Offline Experiments
      • Screenspot (along with referring expressions generated by GPT-4/4o)
      • Multimodal-Mind2Web
      • OmniAct
      • Android Control
    • Online Experiments
      • Mind2Web-Live-SeeAct-V
      • AndroidWorld-SeeAct-V
  • Data-V1
    • Data Examples
    • Data Construction Scripts
    • Guidance of Open-source Data
  • Data-V1.1
  • Online Demo (HF Spaces)

Main Results

GUI Visual Grounding: ScreenSpot (Standard Setting)

Grounding Model Arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
GPT-4 22.6 24.5 20.2 11.8 9.2 8.8 16.2
GPT-4o 20.2 24.9 21.1 23.6 12.2 7.8 18.3
MiniGPT-v2 MiniGPT-v2 8.4 6.6 6.2 2.9 6.5 3.4 5.7
Groma Groma 10.3 2.6 4.6 4.3 5.7 3.4 5.2
Fuyu Fuyu 41.0 1.3 33.0 3.6 33.9 4.4 19.5
Qwen-VL Qwen-VL 9.5 4.8 5.7 5.0 3.5 2.4 5.2
SeeClick Qwen-VL SeeClick 78.0 52.0 72.2 30.0 55.7 32.5 53.4
Qwen-GUI Qwen-VL GUICourse 52.4 10.9 45.9 5.7 43.0 13.6 28.6
UGround-V1 LLaVA-UGround-V1 UGround-V1 82.8 60.3 82.5 63.6 80.4 70.4 73.3
Qwen2-VL Qwen2-VL 61.3 39.3 52.0 45.0 33.0 21.8 42.1
Auguvis-G-7B Qwen2-VL Aguvis-Stage-1 88.3 78.2 88.1 70.7 85.7 74.8 81.0
Auguvis-7B Qwen2-VL Aguvis-Stage-1&2 95.6 77.7 93.8 67.1 88.3 75.2 83.0
OS-Atlas-Base-4B InternVL OS-Atlas 85.7 58.5 72.2 45.7 82.6 63.1 68.0
OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.0 72.9 91.8 62.9 90.9 74.3 81.0
ShowUI-G ShowUI ShowUI 91.6 69.0 81.8 59.0 83.0 65.5 75.0
ShowUI ShowUI ShowUI 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Iris Iris SeeClick 85.3 64.2 86.7 57.5 82.6 71.2 74.6
Aria-UI Aria Aria-UI 92.3 73.8 93.3 64.3 86.5 76.2 81.1
UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 89.4 72.0 88.7 65.7 81.3 68.9 77.7
UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 93.0 79.9 93.8 76.4 90.9 84.0 86.3

GUI Visual Grounding: ScreenSpot (Agent Setting)

Planner Grounding Model Arch SFT data Mobile-Text Mobile-Icon Desktop-Text Desktop-Icon Web-Text Web-Icon Avg
GPT-4o Qwen-VL Qwen-VL 21.3 21.4 18.6 10.7 9.1 5.8 14.5
GPT-4o SeeClick Qwen-VL SeeClick 81.0 59.8 69.6 33.6 43.9 26.2 52.4
GPT-4o Qwen-GUI Qwen-VL GUICourse 67.8 24.5 53.1 16.4 50.4 18.5 38.5
GPT-4o UGround-V1 LLaVA-UGround-V1 UGround-V1 93.4 76.9 92.8 67.9 88.7 68.9 81.4
GPT-4o OS-Atlas-Base-4B InternVL OS-Atlas 94.1 73.8 77.8 47.1 86.5 65.3 74.1
GPT-4o OS-Atlas-Base-7B Qwen2-VL OS-Atlas 93.8 79.9 90.2 66.4 92.6 79.1 83.7
GPT-4o UGround-V1-2B (Qwen2-VL) Qwen2-VL UGround-V1 94.1 77.7 92.8 63.6 90.0 70.9 81.5
GPT-4o UGround-V1-7B (Qwen2-VL) Qwen2-VL UGround-V1 94.1 79.9 93.3 73.6 89.6 73.3 84.0

image/png

Citation Information

If you find this work useful, please consider citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }
Downloads last month
579
Safetensors
Model size
7.06B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Spaces using osunlp/UGround 2

Collection including osunlp/UGround