|
--- |
|
license: other |
|
language: |
|
- zh |
|
- en |
|
base_model: |
|
- THUDM/glm-4v-9b |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
# CogAgent |
|
|
|
[ไธญๆ้
่ฏป](README_zh.md) |
|
|
|
## About the Model |
|
|
|
The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual |
|
open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements, |
|
`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space |
|
completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both |
|
screenshots and language input. |
|
|
|
This version of the CogAgent model has already been applied in |
|
ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers |
|
in advancing the research and applications of GUI agents based on vision-language models. |
|
|
|
## Running the Model |
|
|
|
Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model. |
|
|
|
## Input and Output |
|
|
|
`cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous |
|
conversations but does support continuous execution history. Below are guidelines on how users should format their input |
|
for the model and interpret the formatted output. |
|
|
|
### User Input |
|
|
|
1. **`task` field** |
|
A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide |
|
the `CogAgent-9B-20241220` model to complete the task. |
|
|
|
2. **`platform` field** |
|
`CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces: |
|
- **Windows**: Use the `WIN` field for Windows 10 or 11. |
|
- **Mac**: Use the `MAC` field for Mac 14 or 15. |
|
- **Mobile**: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions. |
|
|
|
If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for |
|
Mac. |
|
|
|
3. **`format` field** |
|
Specifies the desired format of the returned data. Options include: |
|
- `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding |
|
operations, and sensitivity levels. |
|
- `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations. |
|
- `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and |
|
sensitivity levels. |
|
- `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations. |
|
- `Answer in Action-Operation format.`: Returns actions and corresponding operations. |
|
|
|
4. **`history` field** |
|
The input should be concatenated in the following order: |
|
``` |
|
query = f'{task}{history}{platform}{format}' |
|
``` |
|
|
|
### Model Output |
|
|
|
1. **Sensitive Operations**: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only |
|
when `Sensitive` is requested. |
|
2. **`Plan`, `Agent`, `Status`, `Action` fields**: Describe the model's behavior and operations, returned based on the |
|
requested format. |
|
3. **General Responses**: Summarizes the output before formatting. |
|
4. **`Grounded Operation` field**: Describes the model's specific actions, such as coordinates, element types, and |
|
descriptions. Actions include: |
|
- `CLICK`: Simulates mouse clicks or touch gestures. |
|
- `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode). |
|
|
|
### Example |
|
|
|
If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the |
|
prompt should be: |
|
|
|
``` |
|
Task: Mark all emails as read |
|
(Platform: Mac) |
|
(Answer in Action-Operation-Sensitive format.) |
|
``` |
|
|
|
Below are examples of model responses based on different requested formats: |
|
|
|
<details> |
|
<summary>Answer in Action-Operation-Sensitive format</summary> |
|
|
|
``` |
|
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read. |
|
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read') |
|
<<General Operation>> |
|
``` |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Answer in Status-Plan-Action-Operation format</summary> |
|
|
|
``` |
|
Status: None |
|
Plan: None |
|
Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read. |
|
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read') |
|
``` |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Answer in Status-Action-Operation-Sensitive format</summary> |
|
|
|
``` |
|
Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked. |
|
Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read. |
|
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read') |
|
<<General Operation>> |
|
``` |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Answer in Status-Action-Operation format</summary> |
|
|
|
``` |
|
Status: None |
|
Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read. |
|
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read') |
|
``` |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Answer in Action-Operation format</summary> |
|
|
|
``` |
|
Action: Right-click on the first email in the left-side list to open the action menu. |
|
Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]') |
|
``` |
|
|
|
</details> |
|
|
|
### Notes |
|
|
|
1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions |
|
and use the suggested concatenation method. |
|
2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks. |
|
3. The model outputs strictly formatted STR data and does not support JSON format. |
|
|
|
## Previous Work |
|
|
|
In November 2023, we released the first generation of CogAgent. You can find related code and weights in |
|
the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM). |
|
|
|
<div align="center"> |
|
<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% /> |
|
</div> |
|
|
|
<table> |
|
<tr> |
|
<td> |
|
<h2> CogVLM </h2> |
|
<p> ๐ Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p> |
|
<p><b>CogVLM</b> is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.</p> |
|
<p><b>CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks</b> including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.</p> |
|
</td> |
|
<td> |
|
<h2> CogAgent </h2> |
|
<p> ๐ Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p> |
|
<p><b>CogAgent</b> is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, <b>supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.</b></p> |
|
<p><b>CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks,</b> including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.</p> |
|
</td> |
|
</tr> |
|
</table> |
|
|
|
## License |
|
|
|
Please follow the [Model License](LICENSE) for using the model weights. |
|
|