AWQ version
#1
by
Benasd
- opened
Will there be a AWQ version of the 30B-A3B model?
Well, I have tried AWQ quantization before, but it failed. cognitivecomputations/Qwen3-30B-A3B-AWQ also failed to launch with vLLM. So I guess there might be some bugs in AWQ and Qwen3-MoE models.
cognitivecomputations/Qwen3-30B-A3B-AWQ is a re-upload of the official AWQ from modelscope.
It works for me with vllm, but it's slower than llama.cpp so I don't recommend it.
I had to do the following to get it working, and note that tensor_parallel_size 4 won't work:
conda create -n vllm2 python=3.10 -y
conda activate vllm2
pip install 'vllm<0.8.5' # π 0.8.5 seems to have a bug causing it to crash loading the model
pip list |grep transf
#transformers 4.51.3 # π This or newer required for Qwen3
CUDA_VISIBLE_DEVICES=0,1,2,4 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -pp 2 -tp 2 # π Works
CUDA_VISIBLE_DEVICES=0,1,2,4 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -pp 4 # π Works but only 80t/s
CUDA_VISIBLE_DEVICES=0,1 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -tp 2 # π Works
CUDA_VISIBLE_DEVICES=0,1,2,4 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -tp 4 # β Fails
The 235B AWQ also works with -tp2 -pp3 (6 RTX3090's)