cogagent-9b-20241220 / README.md

publish

4437983 6 months ago

8 kB

	---
	license: other
	language:
	- zh
	- en
	base_model:
	- THUDM/glm-4v-9b
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# CogAgent

	[中文阅读](README_zh.md)

	## About the Model

	The `CogAgent-9B-20241220` model is based on [GLM-4V-9B](https://huggingface.co/THUDM/glm-4v-9b), a bilingual
	open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,
	`CogAgent-9B-20241220` achieves significant advancements in GUI perception, inference prediction accuracy, action space
	completeness, and task generalizability. The model supports bilingual (Chinese and English) interaction with both
	screenshots and language input.

	This version of the CogAgent model has already been applied in
	ZhipuAI's [GLM-PC product](https://cogagent.aminer.cn/home). We hope this release will assist researchers and developers
	in advancing the research and applications of GUI agents based on vision-language models.

	## Running the Model

	Please refer to our [GitHub](https://github.com/THUDM/CogAgent) for specific examples of running the model.

	## Input and Output

	`cogagent-9b-20241220` is an agent execution model rather than a conversational model. It does not support continuous
	conversations but does support continuous execution history. Below are guidelines on how users should format their input
	for the model and interpret the formatted output.

	### User Input

	1. `task` field
	A task description provided by the user, similar to a textual prompt. The input should be concise and clear to guide
	the `CogAgent-9B-20241220` model to complete the task.

	2. `platform` field
	`CogAgent-9B-20241220` supports operation on several platforms with GUI interfaces:
	- Windows: Use the `WIN` field for Windows 10 or 11.
	- Mac: Use the `MAC` field for Mac 14 or 15.
	- Mobile: Use the `Mobile` field for Android 13, 14, 15, or similar Android-based UI versions.

	If using other systems, results may vary. Use the `Mobile` field for mobile devices, `WIN` for Windows, and `MAC` for
	Mac.

	3. `format` field
	Specifies the desired format of the returned data. Options include:
	- `Answer in Action-Operation-Sensitive format.`: The default format in our demo, returning actions, corresponding
	operations, and sensitivity levels.
	- `Answer in Status-Plan-Action-Operation format.`: Returns status, plans, actions, and corresponding operations.
	- `Answer in Status-Action-Operation-Sensitive format.`: Returns status, actions, corresponding operations, and
	sensitivity levels.
	- `Answer in Status-Action-Operation format.`: Returns status, actions, and corresponding operations.
	- `Answer in Action-Operation format.`: Returns actions and corresponding operations.

	4. `history` field
	The input should be concatenated in the following order:
	```
	query = f'{task}{history}{platform}{format}'
	```

	### Model Output

	1. Sensitive Operations: Includes types like `<<Sensitive Operation>>` or `<<General Operation>>`, returned only
	when `Sensitive` is requested.
	2. `Plan`, `Agent`, `Status`, `Action` fields: Describe the model's behavior and operations, returned based on the
	requested format.
	3. General Responses: Summarizes the output before formatting.
	4. `Grounded Operation` field: Describes the model's specific actions, such as coordinates, element types, and
	descriptions. Actions include:
	- `CLICK`: Simulates mouse clicks or touch gestures.
	- `LONGPRESS`: Simulates long presses (supported only in `Mobile` mode).

	### Example

	If the user wants to mark all emails as read on a Mac system and requests an `Action-Operation-Sensitive` format, the
	prompt should be:

	```
	Task: Mark all emails as read
	(Platform: Mac)
	(Answer in Action-Operation-Sensitive format.)
	```

	Below are examples of model responses based on different requested formats:

	<details>
	<summary>Answer in Action-Operation-Sensitive format</summary>

	```
	Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
	Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
	<<General Operation>>
	```

	</details>

	<details>
	<summary>Answer in Status-Plan-Action-Operation format</summary>

	```
	Status: None
	Plan: None
	Action: Click the "Mark All as Read" button at the top center of the inbox page to mark all emails as read.
	Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
	```

	</details>

	<details>
	<summary>Answer in Status-Action-Operation-Sensitive format</summary>

	```
	Status: Currently on the email interface [[0, 2, 998, 905]], with email categories on the left [[1, 216, 144, 570]] and the inbox in the center [[144, 216, 998, 903]]. The "Mark All as Read" button [[223, 178, 311, 210]] has been clicked.
	Action: Click the "Mark All as Read" button at the top toolbar to mark all emails as read.
	Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
	<<General Operation>>
	```

	</details>

	<details>
	<summary>Answer in Status-Action-Operation format</summary>

	```
	Status: None
	Action: On the inbox page, click the "Mark All as Read" button to mark all emails as read.
	Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable Text', element_info='Mark All as Read')
	```

	</details>

	<details>
	<summary>Answer in Action-Operation format</summary>

	```
	Action: Right-click on the first email in the left-side list to open the action menu.
	Grounded Operation: RIGHT_CLICK(box=[[154,275,343,341]], element_info='[AXCell]')
	```

	</details>

	### Notes

	1. This model is not a conversational model and does not support continuous dialogue. Please send specific instructions
	and use the suggested concatenation method.
	2. Images must be provided as input; textual prompts alone cannot execute GUI agent tasks.
	3. The model outputs strictly formatted STR data and does not support JSON format.

	## Previous Work

	In November 2023, we released the first generation of CogAgent. You can find related code and weights in
	the [CogVLM & CogAgent Official Repository](https://github.com/THUDM/CogVLM).

	<div align="center">
	<img src=https://raw.githubusercontent.com/THUDM/CogAgent/refs/heads/main/assets/cogagent_function_cn.jpg width=70% />
	</div>

	<table>
	<tr>
	<td>
	<h2> CogVLM </h2>
	<p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
	<p><b>CogVLM</b> is a powerful open-source vision-language model (VLM). CogVLM-17B has 10 billion vision parameters and 7 billion language parameters, supporting 490x490 resolution image understanding and multi-turn conversations.</p>
	<p><b>CogVLM-17B achieved state-of-the-art performance on 10 classic cross-modal benchmarks</b> including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA, and TDIUC benchmarks.</p>
	</td>
	<td>
	<h2> CogAgent </h2>
	<p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
	<p><b>CogAgent</b> is an improved open-source vision-language model based on CogVLM. CogAgent-18B has 11 billion vision parameters and 7 billion language parameters, <b>supporting image understanding at 1120x1120 resolution. Beyond CogVLM's capabilities, it also incorporates GUI agent capabilities.</b></p>
	<p><b>CogAgent-18B achieved state-of-the-art performance on 9 classic cross-modal benchmarks,</b> including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE benchmarks. It significantly outperformed existing models on GUI operation datasets like AITW and Mind2Web.</p>
	</td>
	</tr>
	</table>

	## License

	Please follow the [Model License](LICENSE) for using the model weights.