Qwen3-32B-AWQ-Pile / README.md

lyg95

Update README.md

401f470 verified 2 months ago

preview code

raw

history blame contribute delete

6.05 kB

metadata

license: apache-2.0
datasets:
  - mit-han-lab/pile-val-backup
base_model:
  - Qwen/Qwen3-32B
pipeline_tag: text-generation
library_name: transformers

Qwen3-32B-AWQ-Pile

Qwen3-AWQ Highlights

Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
Precision. Achieves accuracy comparable to the bf16 models.
Process. Provides detailed quantization and testing workflows for easy reproducibility.
Faster. The AutoQuant kernel has been released in vLLM, delivering superior performance compared to the Marlin kernel.

Model Overview

Qwen3-32B has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 32.8B
Number of Paramaters (Non-Embedding): 31.2B
Number of Layers: 64
Number of Attention Heads (GQA): 64 for Q and 8 for KV
Context Length: 32,768 natively and 131,072 tokens with YaRN.
Quantization: AWQ 4-bit

For more details, including benchmark evaluation and inference performance, please refer to our GitHub.

Quantization

calibration data

The model quantization process uses the Pile dataset for calibration. You can obtain the data from https://huggingface.co/datasets/mit-han-lab/pile-val-backup/.

quantization algorithm

The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified AutoAWQ and AutoGPTQ frameworks for this purpose, which are directly usable.

Evaluation

For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:

nothink:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model  --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2

think:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1

Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.

To facilitate testing and reproducibility, we utilized the open-source evalscope tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.

git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .

Performance

Benchmarks

All test results were obtained on the following hardware:

4x NVIDIA A100-40G GPUs
2x NVIDIA H800-80G GPUs

model\benchmarks	think/non-think	math_500	AIME 2024	AIME 2025	MMLU-REDUX	GPQA-Diamond	ceval	gsm8k	ifeval	iquiz	trivia_qa	CMMLU	mmlu
qwen3-32B-BF（paper）	think	97.2	81.4	72.9	90.9	68.4	87.3	\	85.0	\	\	\	\
	non-think	88.6	31.0	20.2	85.7	54.6	83.3	\	83.2	\	\	\	\
Qwen3-32B-BF（self-test）	think	96.0	80.0	66.67	89.04	68.18	88.63	92.72	87.92	84.17	81.43	87.25	87.02
	non-think	85.2	26.67	16.67	86.09	55.05	85.81	89.01	87.50	80.83	75.21	85.48	83.25
qwen3-32B-pile	think	95.2	80.0	70.0	88.51	69.7	88.26	93.71	85.07	83.33	80.08	86.6	86.45
	non-think	84.6	30.0	16.67	85.03	56.57	86.03	89.54	86.72	79.17	73.5	84.54	82.7

Performance

2 x A100-40GB
vllm0.8.5

"To use AutoQuant, simply modify the config.json file as shown below:

"quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "autoquant",  // change from "awq" to "autoquant"
    "version": "gemm",
    "zero_point": true
  },

# throughput
CUDA_VISIBLE_DEVICES=0,1 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100

# latency
CUDA_VISIBLE_DEVICES=0,1 python3 benchmark_latency.py --model /model --num-iters-warmup 10  --num-iters 50  --batch-size 16 --input-len 512 --output-len 512 -tp 2

Throughput

kernel\(tokens/s)	type	in/out=512	in/out=1024	in/out=2048	in/out=4096
awq_marlin	total	2153.85	1875.67	1310.74	910.41
	output	1046.28	910.15	638.11	438.71
autoquant	total	2453.12	2111.43	1416.66	963.93
	output	1198.05	1024.29	689.29	469.88

Latency(average)

kernel\second	batch	in/out=128	in/out=512	in/out=1024	in/out=2048
awq_marlin	16	2.4654	10.1091	21.3455	47.7168
	64	4.8633	20.8356	47.3302	170.8086
autoquant	16	2.3916	9.9021	21.0006	46.9298
	64	4.7231	20.2468	46.0811	168.4375