File size: 7,994 Bytes
892e231
 
 
4437983
 
892e231
4437983
892e231
 
 
 
 
 
4437983
892e231
4437983
892e231
4437983
 
 
 
 
892e231
4437983
 
 
892e231
4437983
892e231
4437983
892e231
4437983
892e231
4437983
 
 
892e231
4437983
892e231
4437983
 
 
892e231
4437983
 
 
 
 
892e231
4437983
 
892e231
4437983
 
 
 
 
 
 
 
 
892e231
4437983
 
892e231
 
 
 
4437983
892e231
4437983
 
 
 
 
 
 
 
 
892e231
4437983
892e231
4437983
 
892e231
 
4437983
892e231
 
 
 
4437983
892e231
 
 
 
 
4437983
 
 
892e231
 
 
 
 
 
 
 
 
4437983
 
 
892e231
 
 
 
 
 
 
 
4437983
 
 
 
892e231
 
 
 
 
 
 
 
 
4437983
 
892e231
 
 
 
 
 
 
 
4437983
892e231
 
 
 
 
4437983
892e231
4437983
 
 
 
892e231
4437983
892e231
4437983
 
892e231
 
ca63a3b
892e231
 
 
 
 
 
 
4437983
 
892e231
 
 
 
4437983
 
892e231
 
 
 
4437983
892e231
4437983
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: other
language:
  - zh
  - en
base_model:
  - THUDM/glm-4v-9b
pipeline_tag: image-text-to-text
library_name: transformers
---

# CogAgent

[中文阅读](README_zh.md)

## About the Model

The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual
open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,
`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space
completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both
screenshots and language input.

This version of the CogAgent model has already been applied in
ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers
in advancing the research and applications of GUI agents based on vision-language models.

## Running the Model

Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model.

## Input and Output

`cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous
conversations but does support continuous execution history. Below are guidelines on how users should format their input
for the model and interpret the formatted output.

### User Input

1. **`task` field**  
   A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide
   the `CogAgent-9B-20241220` model to complete the task.

2. **`platform` field**  
   `CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces:
    - **Windows**: Use the `WIN` field for Windows 10 or 11.
    - **Mac**: Use the `MAC` field for Mac 14 or 15.
    - **Mobile**: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions.

   If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for
   Mac.

3. **`format` field**  
   Specifies the desired format of the returned data. Options include:
    - `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding
      operations, and sensitivity levels.
    - `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations.
    - `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and
      sensitivity levels.
    - `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations.
    - `Answer in Action-Operation format.`: Returns actions and corresponding operations.

4. **`history` field**  
   The input should be concatenated in the following order:
   ```
   query = f'{task}{history}{platform}{format}'
   ```

### Model Output

1. **Sensitive Operations**: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only
   when `Sensitive` is requested.
2. **`Plan`, `Agent`, `Status`, `Action` fields**: Describe the model's behavior and operations, returned based on the
   requested format.
3. **General Responses**: Summarizes the output before formatting.
4. **`Grounded Operation` field**: Describes the model's specific actions, such as coordinates, element types, and
   descriptions. Actions include:
    - `CLICK`: Simulates mouse clicks or touch gestures.
    - `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode).

### Example

If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the
prompt should be:

```
Task: Mark all emails as read
(Platform: Mac)
(Answer in Action-Operation-Sensitive format.)
```

Below are examples of model responses based on different requested formats:

<details>
<summary>Answer in Action-Operation-Sensitive format</summary>

```
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
<<General Operation>>
```

</details>

<details>
<summary>Answer in Status-Plan-Action-Operation format</summary>

```
Status: None
Plan: None
Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
```

</details>

<details>
<summary>Answer in Status-Action-Operation-Sensitive format</summary>

```
Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked.
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
<<General Operation>>
```

</details>

<details>
<summary>Answer in Status-Action-Operation format</summary>

```
Status: None
Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
```

</details>

<details>
<summary>Answer in Action-Operation format</summary>

```
Action: Right-click on the first email in the left-side list to open the action menu.
Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
```

</details>

### Notes

1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions
   and use the suggested concatenation method.
2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks.
3. The model outputs strictly formatted STR data and does not support JSON format.

## Previous Work

In November 2023, we released the first generation of CogAgent. You can find related code and weights in
the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM).

<div align="center">
    <img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function.jpg width=70% />
</div>

<table>
  <tr>
    <td>
      <h2> CogVLM </h2>
      <p> 📖  Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
      <p><b>CogVLM</b> is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.</p>
      <p><b>CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks</b> including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.</p>
    </td>
    <td>
      <h2> CogAgent </h2>
      <p> 📖  Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
      <p><b>CogAgent</b> is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, <b>supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.</b></p>
      <p><b>CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks,</b> including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.</p>
    </td>
  </tr>
</table>

## License

Please follow the [Model License](LICENSE) for using the model weights.