4-bit
Hi @btbtyler09 , Thanks for uploading this 8-bit version. I've been looking for a GPTQ/AWQ version of this model to run with SGlang. Any chance you can upload a 4-bit version (either GPTQ/AWQ)?
Thank you!
I will try that tonight. I'm currently trying to make sure this one works ok with vLLM on my machine. Check back sometime tomorrow. Thanks!
I actually can't seem to get this model to run on my machine, so I probably won't make the 4-bit version yet. It may be a problem with my own setup and not the model, but i'd rather not upload another one until I can verify it's working correctly.
A couple days back I tried https://www.modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ but the model produced gibberish (sequence of punctuations). Perhaps it has something to do with the model when quantized.
Someone just copied that same model over to huggingface - https://huggingface.co/cognitivecomputations/Qwen3-30B-A3B-AWQ
Regarding the GPTQ quantization for MoE, it seems like the model is missing packed_modules_mapping function. I think this is related to vllm which did not define the function yet. Related: https://github.com/vllm-project/vllm/issues/17337
@btbtyler09 were you able to run this model in vllm? If yes, which version of vllm are you using? I am still seeing this error above with the latest vllm.
I actually tested it today on a recent build of vLLM and still have issues. If I find time this weekend, I may try to dig into it again. vLLM supposedly added support for Qwen3 MoE, so I want to get this working if possible.
I have also been testing ck flash attention on MI100s. They still fail some tests, but it seems to work on the models I've tested. It bumped qwen 3 32b from ~30 tok/s to 40 for me. Might be worth a try if you are hoping for a faster model. That model has really been my go to lately, but I know a lot of people like this MoE version.
I was able to get it working with some changes to vLLM. If you want to try it out check out my changes here:
https://github.com/btbtyler09/vllm-gfx908/commit/7d08cd63965d603238b8d72025d7a1e86381c8b5
I'm going to work on a config file for the MI100s to go with it, and then I'll try to submit an issue report to nlzy to fix their mi50 fork.
@btbtyler09
thank you!
I will try your changes once I get back home from work.
What TG speed are you getting with 1xMI100?
Regarding the config files, I found out there is a script to generate MoE config for each model and GPU. The script is called benchmark_moe.py.
Initially, it did not work out of box. I changed the benchmark_moe.py. In def main function, added the second line:
config = get_config(model=args.model, trust_remote_code=args.trust_remote_code)
config = config.__dict__ # this will fix the missing config issue.
There is also a code where it saves the config in a json file. It uses the card name for json filename. MI50 cards show up as MI50/MI60 and there is no way to save the filename with that forward slash. So, also fixed the code somewhere by changing it to:
filename = filename.replace("/", "_") # this line is added here.
with open(filename, "w") as f:
json.dump(configs, f, indent=4)
f.write("\n")
Then ran this command to generate the config file for MI50:
python3 benchmark_moe.py --model /media/ml-ai/wd_2t/models/Qwen3-30B-A3B-gptq-8bit --tp-size 4 --tune --batch_size 1
It generated the config for batch size 1 but running all different batch sizes would take hours.
Nonetheless, I could not use this config since I was stuck with this error in vllm:
File "/media/ml-ai/wd_2t/vllmenv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 927, in __init__
assert quant_method is not None
^^^^^^^^^^^^^^^^^^^^^^^^
I hope your change will fix this vllm issue.
Thanks!
Regarding Qwen3-32B, in tp=4 for Qwen3-32B-AWQ (4bit) I get TG 38t/s. tp=2 gets 30t/s. Interestingly, GPTQ int8 is not far behind. I get 32.3t/s for int8 with tp=4. I agree 32B is much better but for quick responses, 30B MoE is still a good trade-off. In vulkan llama.cpp, I was getting 60t/s for Q4_1 with 2xMI50. ROCm was 57t/s (but better PP ( 300t/s) compared to vulkan).
My main goal is to run Qwen3 moe 235B.A22B AWQ in vllm. llama.cpp with rocm runs Q4_1 at TG 20t/s (PP 200t/s) with 6xMI50. Hoping to get a better result with vllm. AWQ 4bit should fit into 128GB VRAM (4xMI50 for tp, although with lower context).
Thanks for the information on the benchmark_moe script. I ran that and got a config for the MI100, but it failed to get through cuda graph generation. I had to tweak the settings a bit to to make them a little more conservative, but it works now.
For single concurrency i get ~45 tok/s using V0 and ck Flash Attention. On the V1 engine with Triton flash I'm getting ~47. I need to do further benchmarking to see what the best configurations are for this model.
I might try 235B, but I haven't gotten AWQ working on MI100s yet. I need to look at nlzy's version and see if i can do that. I'm only getting ~12 tok/s using llama.cpp with some experts offloaded. I was testing with long context settings though.
Interesting. I need to try V1 engine but not sure if Triton FA supports MI50. Can you please share the CK flash attention repo and commit that works for you so that I can compile and test it? Thanks!
actually, I might have found your CK FA. Is this the one?: https://github.com/btbtyler09/flash-attention-gfx908 (commit 1ad5a80)
Yeah. All i did was add gfx908 to the setup.py list so it will allow the build. It will fail some tests, but it still seems to function ok on the models I've tested. Unfortunately it only works on V0 engine, and you'll still get warnings about triton even though ck Flash is being used. I have fixed some of that in my fork of vLLM. I think nlzy's mi50 repo says V1 has issues on MI50s, but in the latest versions of vLLM it seems to work pretty well. triton might be a little behind flash in support for things like SWA, but that probably doesn't matter for most models.
@btbtyler09 , this repo's 8bit gptq worked with your changes!
However, the official 4bit GPTQ version is throwing an error:
VllmWorkerProcess pid=6653) INFO 06-21 07:11:37 [gptq.py:125] Using MoeWNA16 method for GPTQ MoE layer 'model.layers.47.mlp.experts'
ERROR 06-21 07:11:38 [engine.py:458] 'layers.0.mlp.gate.weight'
ERROR 06-21 07:11:38 [engine.py:458] Traceback (most recent call last):
...
ERROR 06-21 07:11:38 [engine.py:458] File "/media/ml-ai/wd_2t/vllmenv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 469, in load_weights
ERROR 06-21 07:11:38 [engine.py:458] param = params_dict[name]
ERROR 06-21 07:11:38 [engine.py:458] ~~~~~~~~~~~^^^^^^
ERROR 06-21 07:11:38 [engine.py:458] KeyError: 'layers.0.mlp.gate.weight'
Not sure if the gate weight needs to be excluded.
I also tested VLLM V1 with triton FA and it has similar performance to v0. In fact, in large concurrency I am getting 5% better results with V1 Triton FA vs Rocm FA.
Running below for tp=2 gives me ~30t/s for 8bit variant:
VLLM_USE_TRITON_FLASH_ATTN=1 VLLM_USE_V1=1 vllm serve /media/ml-ai/wd_2t/models/Qwen3-30B-A3B-gptq-8bit/ --disable-log-requests --max-model-len 8192 -tp 2 --dtype float16