BoyuNLP commited on
Commit
4653565
·
verified ·
1 Parent(s): 6613e8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md CHANGED
@@ -10,6 +10,114 @@ base_model:
10
  - Qwen/Qwen2-VL-2B
11
  ---
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  # Qwen2-VL-2B-Instruct
14
 
15
  ## Introduction
 
10
  - Qwen/Qwen2-VL-2B
11
  ---
12
 
13
+
14
+
15
+ # UGround-V1-2B (Qwen2-VL-Based)
16
+
17
+ UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details.
18
+ ![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
19
+ - **Homepage:** https://osu-nlp-group.github.io/UGround/
20
+ - **Repository:** https://github.com/OSU-NLP-Group/UGround
21
+ - **Paper:** https://arxiv.org/abs/2410.05243
22
+ - **Demo:** https://huggingface.co/spaces/orby-osu/UGround
23
+ - **Point of Contact:** [Boyu Gou](mailto:[email protected])
24
+
25
+
26
+ - [x] Model Weights
27
+ - [ ] Code
28
+ - [ ] Inference Code of UGround
29
+ - [x] Offline Experiments
30
+ - [x] Screenspot (along with referring expressions generated by GPT-4/4o)
31
+ - [x] Multimodal-Mind2Web
32
+ - [x] OmniAct
33
+ - [ ] Online Experiments
34
+ - [ ] Mind2Web-Live
35
+ - [ ] AndroidWorld
36
+ - [ ] Data
37
+ - [ ] Data Examples
38
+ - [ ] Data Construction Scripts
39
+ - [ ] Guidance of Open-source Data
40
+ - [x] Online Demo (HF Spaces)
41
+
42
+
43
+ ## Inference
44
+
45
+ ### vLLM server
46
+
47
+ ```bash
48
+ vllm serve osunlp/UGround-V1-7B --api-key token-abc123 --dtype float16
49
+ ```
50
+
51
+ ### Visual Grounding Prompt
52
+ ```python
53
+ def format_openai_template(description: str, base64_image):
54
+ return [
55
+ {
56
+ "role": "user",
57
+ "content": [
58
+ {
59
+ "type": "image_url",
60
+ "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
61
+ },
62
+ {
63
+ "type": "text",
64
+ "text": f"""
65
+ Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
66
+
67
+ - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
68
+ - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
69
+ - Your answer should be a single string (x, y) corresponding to the point of the interest.
70
+
71
+ Description: {description}
72
+
73
+ Answer:"""
74
+ },
75
+ ],
76
+ },
77
+ ]
78
+
79
+ messages = format_openai_template(description, base64_image)
80
+
81
+ completion = await client.chat.completions.create(
82
+ model=args.model_path,
83
+ messages=messages,
84
+ temperature=0
85
+ )
86
+
87
+ ```
88
+
89
+
90
+
91
+
92
+
93
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/u5bXFxxAWCXthyXWyZkM4.png)
94
+
95
+ ## Citation Information
96
+
97
+ If you find this work useful, please consider citing our papers:
98
+
99
+ ```
100
+ @article{gou2024uground,
101
+ title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
102
+ author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
103
+ journal={arXiv preprint arXiv:2410.05243},
104
+ year={2024},
105
+ url={https://arxiv.org/abs/2410.05243},
106
+ }
107
+
108
+ @article{zheng2023seeact,
109
+ title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
110
+ author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
111
+ journal={arXiv preprint arXiv:2401.01614},
112
+ year={2024},
113
+ }
114
+ ```
115
+
116
+
117
+
118
+
119
+
120
+
121
  # Qwen2-VL-2B-Instruct
122
 
123
  ## Introduction