AWQ quantized model support timeline?
#12
by
hyunw55
- opened
I've been using Qwen models extensively. Any plans to support AWQ quantized models for Qwen3? Missing the simultaneous AWQ release this time around.
Looking forward to your continued development work.
Thank you
Here is a quantized version, feel free to use it: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ π (Unofficial version)
Thanks @study-hjt I just got it working, takes about 17GB VRAM just to load plus as much extra VRAM as you have for additional context and parallel slots:
CUDA_VISIBLE_DEVICES="0" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
serve swift/Qwen3-30B-A3B-AWQ \
--gpu-memory-utilization 0.9 \
--max-model-len 32768 \
--max-num-seqs 64 \
--served-model-name swift/Qwen3-30B-A3B-AWQ \
--host 127.0.0.1 \
--port 8080
If you're trying to make your own AWQ, this thread might be helpful: https://github.com/vllm-project/llm-compressor/issues/1406