cpatonn commited on
Commit
4fcd75a
·
verified ·
1 Parent(s): fa8b829

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -9
README.md CHANGED
@@ -6,18 +6,24 @@ pipeline_tag: text-generation
6
  base_model:
7
  - Qwen/Qwen3-30B-A3B-Thinking-2507
8
  ---
9
- # Qwen3-30B-A3B-Thinking-2507-AWQ
10
 
11
  ## Method
12
- Quantised using [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git), [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) and the following configs:
 
 
 
13
  ```
14
- recipe = [
15
- AWQModifier(
16
- ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
17
- scheme="W4A16",
18
- targets=["Linear"],
19
- ),
20
- ]
 
 
 
21
  ```
22
 
23
  # Qwen3-30B-A3B-Thinking-2507
 
6
  base_model:
7
  - Qwen/Qwen3-30B-A3B-Thinking-2507
8
  ---
9
+ # Qwen3-30B-A3B-Thinking-2507-AWQ-4bit
10
 
11
  ## Method
12
+ [vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor.git) and [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) were used to quantize the original model. For further quantization arguments and configurations information, please visit [config.json](https://huggingface.co/cpatonn/Qwen3-30B-A3B-Thinking-2507-AWQ-4bit/blob/main/config.json) and [recipe.yaml](https://huggingface.co/cpatonn/Qwen3-30B-A3B-Thinking-2507-AWQ-4bit/blob/main/recipe.yaml).
13
+
14
+ ## Inference
15
+ Please install the latest vllm releases for better support:
16
  ```
17
+ pip install -U vllm
18
+ ```
19
+
20
+ Qwen3-30B-A3B-Thinking-2507-AWQ-4bit example usage:
21
+ ```
22
+ vllm serve cpatonn/Qwen3-30B-A3B-Thinking-2507-AWQ-4bit \
23
+ --dtype float16 \
24
+ --tensor-parallel-size 4 \
25
+ --enable-auto-tool-choice \
26
+ --tool-call-parser hermes
27
  ```
28
 
29
  # Qwen3-30B-A3B-Thinking-2507