AWQ quantized model support timeline?

#12
by hyunw55 - opened

I've been using Qwen models extensively. Any plans to support AWQ quantized models for Qwen3? Missing the simultaneous AWQ release this time around.
Looking forward to your continued development work.
Thank you

Here is a quantized version, feel free to use it: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ 😊 (Unofficial version)

Thanks @study-hjt I just got it working, takes about 17GB VRAM just to load plus as much extra VRAM as you have for additional context and parallel slots:

CUDA_VISIBLE_DEVICES="0" \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
VLLM_USE_MODELSCOPE=True \
vllm \
  serve swift/Qwen3-30B-A3B-AWQ \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --served-model-name swift/Qwen3-30B-A3B-AWQ \
  --host 127.0.0.1 \
  --port 8080

If you're trying to make your own AWQ, this thread might be helpful: https://github.com/vllm-project/llm-compressor/issues/1406

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment