File size: 7,994 Bytes
892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 892e231 ca63a3b 892e231 4437983 892e231 4437983 892e231 4437983 892e231 4437983 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
---
license: other
language:
- zh
- en
base_model:
- THUDM/glm-4v-9b
pipeline_tag: image-text-to-text
library_name: transformers
---
# CogAgent
[ä¸æ–‡é˜…读](README_zh.md)
## About the Model
The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual
open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,
`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space
completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both
screenshots and language input.
This version of the CogAgent model has already been applied in
ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers
in advancing the research and applications of GUI agents based on vision-language models.
## Running the Model
Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model.
## Input and Output
`cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous
conversations but does support continuous execution history. Below are guidelines on how users should format their input
for the model and interpret the formatted output.
### User Input
1. **`task` field**
A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide
the `CogAgent-9B-20241220` model to complete the task.
2. **`platform` field**
`CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces:
- **Windows**: Use the `WIN` field for Windows 10 or 11.
- **Mac**: Use the `MAC` field for Mac 14 or 15.
- **Mobile**: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions.
If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for
Mac.
3. **`format` field**
Specifies the desired format of the returned data. Options include:
- `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding
operations, and sensitivity levels.
- `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations.
- `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and
sensitivity levels.
- `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations.
- `Answer in Action-Operation format.`: Returns actions and corresponding operations.
4. **`history` field**
The input should be concatenated in the following order:
```
query = f'{task}{history}{platform}{format}'
```
### Model Output
1. **Sensitive Operations**: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only
when `Sensitive` is requested.
2. **`Plan`, `Agent`, `Status`, `Action` fields**: Describe the model's behavior and operations, returned based on the
requested format.
3. **General Responses**: Summarizes the output before formatting.
4. **`Grounded Operation` field**: Describes the model's specific actions, such as coordinates, element types, and
descriptions. Actions include:
- `CLICK`: Simulates mouse clicks or touch gestures.
- `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode).
### Example
If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the
prompt should be:
```
Task: Mark all emails as read
(Platform: Mac)
(Answer in Action-Operation-Sensitive format.)
```
Below are examples of model responses based on different requested formats:
<details>
<summary>Answer in Action-Operation-Sensitive format</summary>
```
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
<<General Operation>>
```
</details>
<details>
<summary>Answer in Status-Plan-Action-Operation format</summary>
```
Status: None
Plan: None
Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
```
</details>
<details>
<summary>Answer in Status-Action-Operation-Sensitive format</summary>
```
Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked.
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
<<General Operation>>
```
</details>
<details>
<summary>Answer in Status-Action-Operation format</summary>
```
Status: None
Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
```
</details>
<details>
<summary>Answer in Action-Operation format</summary>
```
Action: Right-click on the first email in the left-side list to open the action menu.
Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
```
</details>
### Notes
1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions
and use the suggested concatenation method.
2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks.
3. The model outputs strictly formatted STR data and does not support JSON format.
## Previous Work
In November 2023, we released the first generation of CogAgent. You can find related code and weights in
the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM).
<div align="center">
<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function.jpg width=70% />
</div>
<table>
<tr>
<td>
<h2> CogVLM </h2>
<p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
<p><b>CogVLM</b> is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.</p>
<p><b>CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks</b> including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.</p>
</td>
<td>
<h2> CogAgent </h2>
<p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
<p><b>CogAgent</b> is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, <b>supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.</b></p>
<p><b>CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks,</b> including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.</p>
</td>
</tr>
</table>
## License
Please follow the [Model License](LICENSE) for using the model weights.
|