Qwen3-32B-AWQ-Pile / README.md
lyg95's picture
Update README.md
401f470 verified
metadata
license: apache-2.0
datasets:
  - mit-han-lab/pile-val-backup
base_model:
  - Qwen/Qwen3-32B
pipeline_tag: text-generation
library_name: transformers

Qwen3-32B-AWQ-Pile

Qwen3-AWQ Highlights

  • Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
  • Precision. Achieves accuracy comparable to the bf16 models.
  • Process. Provides detailed quantization and testing workflows for easy reproducibility.
  • Faster. The AutoQuant kernel has been released in vLLM, delivering superior performance compared to the Marlin kernel.

Model Overview

Qwen3-32B has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 32.8B
  • Number of Paramaters (Non-Embedding): 31.2B
  • Number of Layers: 64
  • Number of Attention Heads (GQA): 64 for Q and 8 for KV
  • Context Length: 32,768 natively and 131,072 tokens with YaRN.
  • Quantization: AWQ 4-bit

For more details, including benchmark evaluation and inference performance, please refer to our GitHub.

Quantization

  • calibration data

The model quantization process uses the Pile dataset for calibration. You can obtain the data from https://huggingface.co/datasets/mit-han-lab/pile-val-backup/.

  • quantization algorithm

The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified AutoAWQ and AutoGPTQ frameworks for this purpose, which are directly usable.

Evaluation

For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:

nothink:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model  --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2

think:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1

Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.

To facilitate testing and reproducibility, we utilized the open-source evalscope tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.

git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .

Performance

Benchmarks

All test results were obtained on the following hardware:

  • 4x NVIDIA A100-40G GPUs
  • 2x NVIDIA H800-80G GPUs
model\benchmarks think/non-think math_500 AIME 2024 AIME 2025 MMLU-REDUX GPQA-Diamond ceval gsm8k ifeval iquiz trivia_qa CMMLU mmlu
qwen3-32B-BF(paper) think 97.2 81.4 72.9 90.9 68.4 87.3 \ 85.0 \ \ \ \
non-think 88.6 31.0 20.2 85.7 54.6 83.3 \ 83.2 \ \ \ \
Qwen3-32B-BF(self-test) think 96.0 80.0 66.67 89.04 68.18 88.63 92.72 87.92 84.17 81.43 87.25 87.02
non-think 85.2 26.67 16.67 86.09 55.05 85.81 89.01 87.50 80.83 75.21 85.48 83.25
qwen3-32B-pile think 95.2 80.0 70.0 88.51 69.7 88.26 93.71 85.07 83.33 80.08 86.6 86.45
non-think 84.6 30.0 16.67 85.03 56.57 86.03 89.54 86.72 79.17 73.5 84.54 82.7

Performance

  • 2 x A100-40GB
  • vllm0.8.5

"To use AutoQuant, simply modify the config.json file as shown below:

"quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "autoquant",  // change from "awq" to "autoquant"
    "version": "gemm",
    "zero_point": true
  },
# throughput
CUDA_VISIBLE_DEVICES=0,1 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100

# latency
CUDA_VISIBLE_DEVICES=0,1 python3 benchmark_latency.py --model /model --num-iters-warmup 10  --num-iters 50  --batch-size 16 --input-len 512 --output-len 512 -tp 2
  • Throughput
kernel\(tokens/s) type in/out=512 in/out=1024 in/out=2048 in/out=4096
awq_marlin total 2153.85 1875.67 1310.74 910.41
output 1046.28 910.15 638.11 438.71
autoquant total 2453.12 2111.43 1416.66 963.93
output 1198.05 1024.29 689.29 469.88
  • Latency(average)
kernel\second batch in/out=128 in/out=512 in/out=1024 in/out=2048
awq_marlin 16 2.4654 10.1091 21.3455 47.7168
64 4.8633 20.8356 47.3302 170.8086
autoquant 16 2.3916 9.9021 21.0006 46.9298
64 4.7231 20.2468 46.0811 168.4375