osunlp
/

UGround-V1-2B

 - Qwen/Qwen2-VL-2B
 ---
+# UGround-V1-2B （Qwen2-VL-Based)
+UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details.
+![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
+- **Homepage:** https://osu-nlp-group.github.io/UGround/
+- **Repository:** https://github.com/OSU-NLP-Group/UGround
+- **Paper:** https://arxiv.org/abs/2410.05243
+- **Demo:** https://huggingface.co/spaces/orby-osu/UGround
+- **Point of Contact:** [Boyu Gou](mailto:[email protected])
+- [x] Model Weights
+- [ ] Code
+  - [ ] Inference Code of UGround
+  - [x] Offline Experiments
+    - [x] Screenspot (along with referring expressions generated by GPT-4/4o)
+    - [x] Multimodal-Mind2Web
+    - [x] OmniAct
+  - [ ] Online Experiments
+    - [ ] Mind2Web-Live
+    - [ ] AndroidWorld
+- [ ] Data
+  - [ ] Data Examples
+  - [ ] Data Construction Scripts
+  - [ ] Guidance of Open-source Data
+- [x] Online Demo (HF Spaces)
+## Inference
+### vLLM server
+```bash
+vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16
+```
+### Visual Grounding Prompt
+```python
+def format_openai_template(description: str, base64_image):
+    return [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
+                },
+                {
+                    "type": "text",
+                    "text": f"""
+  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
+  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
+  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
+  - Your answer should be a single string (x, y) corresponding to the point of the interest.
+  Description: {description}
+  Answer:"""
+                },
+            ],
+        },
+    ]
+messages = format_openai_template(description, base64_image)
+completion = await client.chat.completions.create(
+    model=args.model_path,
+    messages=messages,
+    temperature=0
+)
+```
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)
+## Citation Information
+If you find this work useful, please consider citing our papers:
+```
+@article{gou2024uground,
+        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
+        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
+        journal={arXiv preprint arXiv:2410.05243},
+        year={2024},
+        url={https://arxiv.org/abs/2410.05243},
+      }
+@article{zheng2023seeact,
+        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
+        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
+        journal={arXiv preprint arXiv:2401.01614},
+        year={2024},
+      }
+```
 # Qwen2-VL-2B-Instruct
 ## Introduction