4-bit

#1
by prashanthellina - opened

Hi @btbtyler09 , Thanks for uploading this 8-bit version. I've been looking for a GPTQ/AWQ version of this model to run with SGlang. Any chance you can upload a 4-bit version (either GPTQ/AWQ)?

Thank you!

I will try that tonight. I'm currently trying to make sure this one works ok with vLLM on my machine. Check back sometime tomorrow. Thanks!

I actually can't seem to get this model to run on my machine, so I probably won't make the 4-bit version yet. It may be a problem with my own setup and not the model, but i'd rather not upload another one until I can verify it's working correctly.

A couple days back I tried https://www.modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ but the model produced gibberish (sequence of punctuations). Perhaps it has something to do with the model when quantized.

Someone just copied that same model over to huggingface - https://huggingface.co/cognitivecomputations/Qwen3-30B-A3B-AWQ

Regarding the GPTQ quantization for MoE, it seems like the model is missing packed_modules_mapping function. I think this is related to vllm which did not define the function yet. Related: https://github.com/vllm-project/vllm/issues/17337

@btbtyler09 were you able to run this model in vllm? If yes, which version of vllm are you using? I am still seeing this error above with the latest vllm.

@MLDataScientist

I actually tested it today on a recent build of vLLM and still have issues. If I find time this weekend, I may try to dig into it again. vLLM supposedly added support for Qwen3 MoE, so I want to get this working if possible.

I have also been testing ck flash attention on MI100s. They still fail some tests, but it seems to work on the models I've tested. It bumped qwen 3 32b from ~30 tok/s to 40 for me. Might be worth a try if you are hoping for a faster model. That model has really been my go to lately, but I know a lot of people like this MoE version.

@MLDataScientist

I was able to get it working with some changes to vLLM. If you want to try it out check out my changes here:
https://github.com/btbtyler09/vllm-gfx908/commit/7d08cd63965d603238b8d72025d7a1e86381c8b5

I'm going to work on a config file for the MI100s to go with it, and then I'll try to submit an issue report to nlzy to fix their mi50 fork.

@btbtyler09 thank you!
I will try your changes once I get back home from work.

What TG speed are you getting with 1xMI100?

Regarding the config files, I found out there is a script to generate MoE config for each model and GPU. The script is called benchmark_moe.py.
Initially, it did not work out of box. I changed the benchmark_moe.py. In def main function, added the second line:

config = get_config(model=args.model, trust_remote_code=args.trust_remote_code)
config = config.__dict__ # this will fix the missing config issue.

There is also a code where it saves the config in a json file. It uses the card name for json filename. MI50 cards show up as MI50/MI60 and there is no way to save the filename with that forward slash. So, also fixed the code somewhere by changing it to:

filename = filename.replace("/", "_")  # this line is added here.
with open(filename, "w") as f:

        json.dump(configs, f, indent=4)

        f.write("\n") 

Then ran this command to generate the config file for MI50:

python3 benchmark_moe.py --model /media/ml-ai/wd_2t/models/Qwen3-30B-A3B-gptq-8bit --tp-size 4 --tune --batch_size 1

It generated the config for batch size 1 but running all different batch sizes would take hours.

Nonetheless, I could not use this config since I was stuck with this error in vllm:

File "/media/ml-ai/wd_2t/vllmenv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 927, in __init__

    assert quant_method is not None

           ^^^^^^^^^^^^^^^^^^^^^^^^ 

I hope your change will fix this vllm issue.

Thanks!

Regarding Qwen3-32B, in tp=4 for Qwen3-32B-AWQ (4bit) I get TG 38t/s. tp=2 gets 30t/s. Interestingly, GPTQ int8 is not far behind. I get 32.3t/s for int8 with tp=4. I agree 32B is much better but for quick responses, 30B MoE is still a good trade-off. In vulkan llama.cpp, I was getting 60t/s for Q4_1 with 2xMI50. ROCm was 57t/s (but better PP ( 300t/s) compared to vulkan).

My main goal is to run Qwen3 moe 235B.A22B AWQ in vllm. llama.cpp with rocm runs Q4_1 at TG 20t/s (PP 200t/s) with 6xMI50. Hoping to get a better result with vllm. AWQ 4bit should fit into 128GB VRAM (4xMI50 for tp, although with lower context).

Thanks for the information on the benchmark_moe script. I ran that and got a config for the MI100, but it failed to get through cuda graph generation. I had to tweak the settings a bit to to make them a little more conservative, but it works now.

For single concurrency i get ~45 tok/s using V0 and ck Flash Attention. On the V1 engine with Triton flash I'm getting ~47. I need to do further benchmarking to see what the best configurations are for this model.

I might try 235B, but I haven't gotten AWQ working on MI100s yet. I need to look at nlzy's version and see if i can do that. I'm only getting ~12 tok/s using llama.cpp with some experts offloaded. I was testing with long context settings though.

Interesting. I need to try V1 engine but not sure if Triton FA supports MI50. Can you please share the CK flash attention repo and commit that works for you so that I can compile and test it? Thanks!

actually, I might have found your CK FA. Is this the one?: https://github.com/btbtyler09/flash-attention-gfx908 (commit 1ad5a80)

Yeah. All i did was add gfx908 to the setup.py list so it will allow the build. It will fail some tests, but it still seems to function ok on the models I've tested. Unfortunately it only works on V0 engine, and you'll still get warnings about triton even though ck Flash is being used. I have fixed some of that in my fork of vLLM. I think nlzy's mi50 repo says V1 has issues on MI50s, but in the latest versions of vLLM it seems to work pretty well. triton might be a little behind flash in support for things like SWA, but that probably doesn't matter for most models.

@btbtyler09 , this repo's 8bit gptq worked with your changes!

However, the official 4bit GPTQ version is throwing an error:

VllmWorkerProcess pid=6653) INFO 06-21 07:11:37 [gptq.py:125] Using MoeWNA16 method for GPTQ MoE layer 'model.layers.47.mlp.experts'
ERROR 06-21 07:11:38 [engine.py:458] 'layers.0.mlp.gate.weight'
ERROR 06-21 07:11:38 [engine.py:458] Traceback (most recent call last):
...
ERROR 06-21 07:11:38 [engine.py:458]   File "/media/ml-ai/wd_2t/vllmenv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 469, in load_weights
ERROR 06-21 07:11:38 [engine.py:458]     param = params_dict[name]
ERROR 06-21 07:11:38 [engine.py:458]             ~~~~~~~~~~~^^^^^^
ERROR 06-21 07:11:38 [engine.py:458] KeyError: 'layers.0.mlp.gate.weight'

Not sure if the gate weight needs to be excluded.
I also tested VLLM V1 with triton FA and it has similar performance to v0. In fact, in large concurrency I am getting 5% better results with V1 Triton FA vs Rocm FA.

Running below for tp=2 gives me ~30t/s for 8bit variant:

VLLM_USE_TRITON_FLASH_ATTN=1 VLLM_USE_V1=1 vllm serve /media/ml-ai/wd_2t/models/Qwen3-30B-A3B-gptq-8bit/  --disable-log-requests --max-model-len 8192 -tp 2  --dtype float16

I think nlzy's mi50 repo says V1 has issues on MI50s, but in the latest versions of vLLM it seems to work pretty well.

Yeah, back in vLLM 0.9.1, the upstream hadn't merged the full CUDA graph feature for Triton unified attention into the main branch yet, which caused a pretty big performance drop compared to V0. But now they've merged that feature upstream, so using V1 on MI50 is totally fine.

@MLDataScientist I was able to get Qwen's 4 bit model to load with some more changes. You can find them here: https://github.com/btbtyler09/vllm-gfx908/tree/feature/mi100-moe-fixes

Unfortunately that model produced infinite !!!! for me... I tried some different config settings, but I couldn't find one that worked. I decided to make a 4 bit version using my normal gptq settings which seemed to work for this model too. That seems to work ok. https://huggingface.co/btbtyler09/Qwen3-30B-A3B-gptq-4bit

I think it may have to do with group size, but I'm not 100% sure. Qwen's model used group sizes of 128, and I've had similar issues with other group sizes. I quantized my model at 32 and that seems to work.

Thank you for sharing! I will try it out soon.

Can you please tell me how one can get started to learn more about ROCm GPU architecture and fixing vLLM to work with different models? I have a programming background but I don't have specific background in ROCm and vLLM. Basically, can you share some books, links, articles that helped you to get started? Thanks!

To be honest, I have been using claude code a lot to figure this vllm stuff out. My background is aerodynamics and I started building a rag tool for our buisness about a year ago. Learning vLLM became a bit of a necessity and reviewing the diffs in nlzy's repo helped me learn more about the framework of vllm. Along the way I just had to keep researching all the various posts and blogs about using rocm (including some of yours).

A lot of it is very frustrating. Most repos are full of incorrect flags pushing rocm hardware down bad execution paths. Most of what I've done is just tracing through the execution paths to figure out what operations are causing issues. It seems like triton now has good support for ROCm, so sending operations there instead of using custom kernels seems to be ok. Things are improving and I think ROCm 7 may be a big step to make all these projects easier to develop for cross-platform compatibility. I'm far from an expert in this stuff. Still learning as I go. I definitely can't write something like this from scratch, but I have been considering an attempt to write a basic CFD solver with support for my gpus. Maybe eventually I'll get there.

that is impressive! Yeah, I think it is best to learn from existing repos like nlzy's and use LLMs to check if the fix works or fails. Debug the code to find out what is causing the failure. I never have time to learn ROCm / vllm from scratch and I am looking for some resources to learn the key moving parts/concepts.
I have not tried ROCm 7 yet. I am afraid that it will break my system since the latest version of ROCm officially deprecated support for MI50/60s.
Thanks for sharing the details!

Hello @btbtyler09 ,

I was able to finally work on this AWQ support for Qwen3 235B A22B Moe in vllm. It was straightforward. I applied similar edits that you shared for gptq: https://github.com/btbtyler09/vllm-gfx908/commit/7d08cd63965d603238b8d72025d7a1e86381c8b5.
Edited vllm-gfx906/vllm/model_executor/layers/quantization/awq.py. Here is that new get_quant_method function:

    def get_quant_method(self, layer: torch.nn.Module,
                         prefix: str) -> Optional[Union["GPTQLinearMethod", "QuantizeMethodBase"]]:
        if isinstance(layer, FusedMoE):
            # AWQ MoE support: fall back to MoeWNA16 for broad compatibility
            from .moe_wna16 import MoeWNA16Config
            
            logger.info(f"Using MoeWNA16 method for AWQ MoE layer '{prefix}'")
            config = {
                "quant_method": "awq",
                "bits": self.weight_bits,
                "group_size": self.group_size,
                "zero_point": True,
                "modules_to_not_convert": ["mlp.gate", "lm_head"],
                "version": "gemm",
                "num_experts": 128
            }
            return MoeWNA16Config.from_config(config).get_quant_method(
                layer, prefix)
        if isinstance(layer, LinearBase):
            if is_layer_skipped_awq(prefix, self.modules_to_not_convert):
                return UnquantizedLinearMethod()
            return AWQLinearMethod(self)
        return None

Qwen3 235B A22B AWQ is 124GB in size. It fits 4xMI50 32GB with gpu utilization set to 0.99. I tested with 8192 context and it worked fine. However, I have not tested larger context. The TG speed is not great, actually. I am getting 5t/s for TG and 90 t/s for PP. There is definitely a room for optimization. In llama.cpp for Q4_1 with 4xMI50 with some experts offloaded to CPU RAM, I got 10t/s TG. So, AWQ should be faster since it is entirely in GPU VRAM.

@MLDataScientist

That's very cool. I need to see if I can run that on my MI100s.

Have you looked into vllm's expert parallel flag? I think that could speed up MoE inference, but I haven't tried it yet.

https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment.html

Yes, I tried to enable enable-expert-parallel flag but I found out it does not work with 4 GPUs. One needs 8 GPUs to use that as described here: https://huggingface.co/QuantTrio/Qwen3-235B-A22B-Thinking-2507-GPTQ-Int4-Int8Mix:

When launching with 8 GPUs, --enable-expert-parallel must be specified; otherwise, the expert tensors cannot be evenly split across tensor parallel ranks. This option is not required for 4-GPU setups. 

The link you shared has some interesting commands. e.g.:

    vllm serve deepseek-ai/DeepSeek-V3-0324 \
    --tensor-parallel-size 1 \      # Tensor parallelism across 1 GPU
    --data-parallel-size 8 \         # Data parallelism across 8 processes
    --enable-expert-parallel         # Enable expert parallelism

For Qwen3 235B AWQ, I could try --data-parallel-size 4 and tp 1. I have not tried tp 1 yet. I assumed tp 1 is only valid when you have enough space to fit the entire model. But it is not true based on above example. Not sure if it is going to work though with 4xMI50. I will try it later today.

I did look into EP a bit, and realized that the install scripts for the necessary prereqs only target nvidia. It seems AMD's approach has been to deliver custom MoE kernels via their AITER project... Leaving us out of support again... I'm going to see if i can build AITER for gfx908 and check if passes the tests, or benefits inference at all.

data parallel looks like it's for putting the entire model on each GPU. I'm not sure that's what we need, but it may be useful for speeding up smaller model inference for large user counts.

@MLDataScientist I was able to get Qwen's 4 bit model to load with some more changes. You can find them here: https://github.com/btbtyler09/vllm-gfx908/tree/feature/mi100-moe-fixes

Unfortunately that model produced infinite !!!! for me... I tried some different config settings, but I couldn't find one that worked. I decided to make a 4 bit version using my normal gptq settings which seemed to work for this model too. That seems to work ok. https://huggingface.co/btbtyler09/Qwen3-30B-A3B-gptq-4bit

I think it may have to do with group size, but I'm not 100% sure. Qwen's model used group sizes of 128, and I've had similar issues with other group sizes. I quantized my model at 32 and that seems to work.

Official vllm will support quantized Qwen3 MoE models soon. https://github.com/vllm-project/vllm/issues/22001
BTW, running AWQ quantized models on MI100 is super slow, as it uses triton AWQ. May I know if you have native AWQ kernel to accelerate AWQ models on MI100?

@hnhyzz , I am using this repo https://github.com/nlzy/vllm-gfx906. It uses gptq kernels for AWQ as well. As a result, you will get the performance of gptq on AWQ models. GPTQ kernels are faster for my use case.

@MLDataScientist were you able to get MoE working with MI50s? If so, could you share your branch?

I applied this patch to vllm-gfx906, but getting memory errors. I suspect this is because I also merged main from vllm.

@anikifoss ,
Yes, I was able to run qwen3 MOE with MI50s in vllm (both AWQ and GPTQ). However, the speed was not as good as llama.cpp. e.g. qwen3 30B-A3B gptq tp=2 gives me ~30t/s for 8bit variant (model is from this current HF repo). Qwen3 235B A22B official AWQ runs on 4xMI50 32GB at 5 t/s which is slower than llama.cpp at 10t/s. I don't have access to my PC right now. But I will commit the branch once I get back to it during this weekend.
Meanwhile, what I recommend is that you install vllm from this repo - https://github.com/nlzy/vllm-gfx906 and apply that patch https://github.com/btbtyler09/vllm-gfx908/commit/7d08cd63965d603238b8d72025d7a1e86381c8b5 within your vllm installed location.

Sign up or log in to comment