Intel/Qwen3-235B-A22B-q2ks-mixed-AutoRound · Could you please consider release a 2-bit GPTQ or AWQ model for vllm

kq

2 days ago

I have spend two days to deploy this gguf in vllm, only finding even the latest pre-release vllm can not support GGUF with this Model architectures, which is already available in GPTQ or AWQ format
(APIServer pid=59168) Value error, Model architectures ['Qwen3MoeForCausalLM'] failed to be inspected. Please check the logs for more details. [type=value_error, input_value=ArgsKwargs((), {'model': ...gits_processors': None}), input_type=ArgsKwargs]

I do not want to switch to llama.cpp due to toolchain reason.
In the past, it was impossible to deliver reasonable performace using 2-bit model. How ever after autoround, it's now worth a serious trying.
I am alway a autoround advocator, could you please consider making a 2-bit model for vllm?
what a have is 4xRTX 3090 24GB =96GB . The 2-bit Deepseek 0528 is impossibly large for me .

Thank you very much.

wenhuach

Intel org 2 days ago

•

edited 2 days ago

Due to kernel limitations, cuda 2-bit models is much slower than 4-bit and currently only supports the FP16 dtype, which can lead to accuracy issues. This is why we prioritize gguf q2k-s.

1 You can try AutoRound in RTN mode (iters=0) to quickly check whether the model meets your needs. RTN is calibration-free and very fast, but at 2 bits it may cause significant accuracy loss.

2 In practice, we have successfully quantized a 235B model on a single 80GB card. You can also quantize the model yourself, see our README for instructions on using device_map to distribute across multiple GPUs.

kq

2 days ago

Thank you very much. I will have a try.

wenhuach

Intel org 2 days ago

You’re welcome. If any issues come up on your side, we’ll try to generate the model when we’re available, typically on weekends.