cpatonn
/

GLM-4.5-Air-AWQ-4bit

Text Generation

compressed-tensors

Model card Files Files and versions

cpatonn commited on Aug 25

Commit

ab26adb

·

verified ·

1 Parent(s): 3b92c19

Update README.md

Files changed (1) hide show

README.md +12 -24

README.md CHANGED Viewed

@@ -10,31 +10,12 @@ base_model:
 ---
 # GLM-4.5-Air-AWQ
 ## Method
-Quantised using [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git), [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) and the following configs:
-```
-config_groups = {
-    "group_0": {
-        "targets": ["Linear"],
-        "input_activations": None,
-        "output_activations": None,
-        "weights": {
-            "num_bits": 4,
-            "type": "int",
-            "symmetric": True,
-            "strategy": "group",
-            "group_size": 32,
-            }
-    }
-}
-recipe = [
-    AWQModifier(
-        ignore=["lm_head", "re:.*mlp.gate$"],
-        config_groups=config_groups,
-        ),
-]
-```
 Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations.
 ## Inference
 ### Prerequisite
@@ -48,8 +29,15 @@ pip install -U vllm \
 ### vllm
 Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensor_parallel_size <= 2` i.e.,
 ```
-vllm serve cpatonn/GLM-4.5-Air-AWQ-4bit --dtype float16 --tensor-parallel-size 2 --pipeline-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
 ```
 # GLM-4.5-Air
 <div align="center">

 ---
 # GLM-4.5-Air-AWQ
 ## Method
+[vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git) and [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) were used to quantize the original model. For further quantization arguments and configurations information, please visit [config.json](https://huggingface.co/cpatonn/GLM-4.5-Air-AWQ-4bit/blob/main/config.json) and [recipe.yaml](https://huggingface.co/cpatonn/GLM-4.5-Air-AWQ-4bit/blob/main/recipe.yaml).
 Note: the last layer, i.e., the MTP layer index 46 is ignored due to transformers not having MTP implementations.
 ## Inference
 ### Prerequisite
 ### vllm
 Please load the model into vllm and sglang as float16 data type for AWQ support and use `tensor_parallel_size <= 2` i.e.,
 ```
+vllm serve cpatonn/GLM-4.5-Air-AWQ-4bit \
+    --dtype float16
+    --tensor-parallel-size 2
+    --pipeline-parallel-size 2
+    --tool-call-parser glm45
+    --reasoning-parser glm45
+    --enable-auto-tool-choice
 ```
 # GLM-4.5-Air
 <div align="center">