Update README.md
Browse files
README.md
CHANGED
@@ -10,31 +10,12 @@ base_model:
|
|
10 |
---
|
11 |
# GLM-4.5-Air-AWQ
|
12 |
|
|
|
13 |
## Method
|
14 |
-
|
15 |
-
|
16 |
-
config_groups = {
|
17 |
-
"group_0": {
|
18 |
-
"targets": ["Linear"],
|
19 |
-
"input_activations": None,
|
20 |
-
"output_activations": None,
|
21 |
-
"weights": {
|
22 |
-
"num_bits": 4,
|
23 |
-
"type": "int",
|
24 |
-
"symmetric": True,
|
25 |
-
"strategy": "group",
|
26 |
-
"group_size": 32,
|
27 |
-
}
|
28 |
-
}
|
29 |
-
}
|
30 |
-
recipe = [
|
31 |
-
AWQModifier(
|
32 |
-
ignore=["lm_head", "re:.*mlp.gate$"],
|
33 |
-
config_groups=config_groups,
|
34 |
-
),
|
35 |
-
]
|
36 |
-
```
|
37 |
Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations.
|
|
|
38 |
## Inference
|
39 |
|
40 |
### Prerequisite
|
@@ -48,8 +29,15 @@ pip install -U vllm \
|
|
48 |
### vllm
|
49 |
Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensor_parallel_size <= 2` i.e.,
|
50 |
```
|
51 |
-
vllm serve cpatonn/GLM-4.5-Air-AWQ-4bit
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
```
|
|
|
53 |
# GLM-4.5-Air
|
54 |
|
55 |
<div align="center">
|
|
|
10 |
---
|
11 |
# GLM-4.5-Air-AWQ
|
12 |
|
13 |
+
|
14 |
## Method
|
15 |
+
[vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git) and [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) were used to quantize the original model. For further quantization arguments and configurations information, please visit [config.json](https://huggingface.co/cpatonn/GLM-4.5-Air-AWQ-4bit/blob/main/config.json) and [recipe.yaml](https://huggingface.co/cpatonn/GLM-4.5-Air-AWQ-4bit/blob/main/recipe.yaml).
|
16 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations.
|
18 |
+
|
19 |
## Inference
|
20 |
|
21 |
### Prerequisite
|
|
|
29 |
### vllm
|
30 |
Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensor_parallel_size <= 2` i.e.,
|
31 |
```
|
32 |
+
vllm serve cpatonn/GLM-4.5-Air-AWQ-4bit \
|
33 |
+
--dtype float16
|
34 |
+
--tensor-parallel-size 2
|
35 |
+
--pipeline-parallel-size 2
|
36 |
+
--tool-call-parser glm45
|
37 |
+
--reasoning-parser glm45
|
38 |
+
--enable-auto-tool-choice
|
39 |
```
|
40 |
+
|
41 |
# GLM-4.5-Air
|
42 |
|
43 |
<div align="center">
|