AWQ version

#1
by Benasd - opened

Will there be a AWQ version of the 30B-A3B model?

Well, I have tried AWQ quantization before, but it failed. cognitivecomputations/Qwen3-30B-A3B-AWQ also failed to launch with vLLM. So I guess there might be some bugs in AWQ and Qwen3-MoE models.

cognitivecomputations/Qwen3-30B-A3B-AWQ is a re-upload of the official AWQ from modelscope.

It works for me with vllm, but it's slower than llama.cpp so I don't recommend it.

I had to do the following to get it working, and note that tensor_parallel_size 4 won't work:

conda create -n vllm2 python=3.10 -y
conda activate vllm2
pip install 'vllm<0.8.5' # πŸ‘ˆ 0.8.5 seems to have a bug causing it to crash loading the model

pip list |grep transf
#transformers  4.51.3 # πŸ‘ˆ This or newer required for Qwen3

CUDA_VISIBLE_DEVICES=0,1,2,4 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -pp 2 -tp 2 # πŸ‘ˆ Works
CUDA_VISIBLE_DEVICES=0,1,2,4 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -pp 4 # πŸ‘ˆ Works but only 80t/s
CUDA_VISIBLE_DEVICES=0,1 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -tp 2 # πŸ‘ˆ Works
CUDA_VISIBLE_DEVICES=0,1,2,4 vllm serve cognitivecomputations/Qwen3-30B-A3B-AWQ --port 8080 --max-model-len 16384 -tp 4 # ❌ Fails

The 235B AWQ also works with -tp2 -pp3 (6 RTX3090's)

Sign up or log in to comment