cpatonn commited on
Commit
ab26adb
·
verified ·
1 Parent(s): 3b92c19

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -24
README.md CHANGED
@@ -10,31 +10,12 @@ base_model:
10
  ---
11
  # GLM-4.5-Air-AWQ
12
 
 
13
  ## Method
14
- Quantised using [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git), [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) and the following configs:
15
- ```
16
- config_groups = {
17
- "group_0": {
18
- "targets": ["Linear"],
19
- "input_activations": None,
20
- "output_activations": None,
21
- "weights": {
22
- "num_bits": 4,
23
- "type": "int",
24
- "symmetric": True,
25
- "strategy": "group",
26
- "group_size": 32,
27
- }
28
- }
29
- }
30
- recipe = [
31
- AWQModifier(
32
- ignore=["lm_head", "re:.*mlp.gate$"],
33
- config_groups=config_groups,
34
- ),
35
- ]
36
- ```
37
  Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations.
 
38
  ## Inference
39
 
40
  ### Prerequisite
@@ -48,8 +29,15 @@ pip install -U vllm \
48
  ### vllm
49
  Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensor_parallel_size <= 2` i.e.,
50
  ```
51
- vllm serve cpatonn/GLM-4.5-Air-AWQ-4bit --dtype float16 --tensor-parallel-size 2 --pipeline-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
 
 
 
 
 
 
52
  ```
 
53
  # GLM-4.5-Air
54
 
55
  <div align="center">
 
10
  ---
11
  # GLM-4.5-Air-AWQ
12
 
13
+
14
  ## Method
15
+ [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git) and [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) were used to quantize the original model. For further quantization arguments and configurations information, please visit [config.json](https://huggingface.co/cpatonn/GLM-4.5-Air-AWQ-4bit/blob/main/config.json) and [recipe.yaml](https://huggingface.co/cpatonn/GLM-4.5-Air-AWQ-4bit/blob/main/recipe.yaml).
16
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations.
18
+
19
  ## Inference
20
 
21
  ### Prerequisite
 
29
  ### vllm
30
  Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensor_parallel_size <= 2` i.e.,
31
  ```
32
+ vllm serve cpatonn/GLM-4.5-Air-AWQ-4bit \
33
+ --dtype float16
34
+ --tensor-parallel-size 2
35
+ --pipeline-parallel-size 2
36
+ --tool-call-parser glm45
37
+ --reasoning-parser glm45
38
+ --enable-auto-tool-choice
39
  ```
40
+
41
  # GLM-4.5-Air
42
 
43
  <div align="center">