GGUF
conversational

Could you please consider release a 2-bit GPTQ or AWQ model for vllm

#1
by kq - opened

I have spend two days to deploy this gguf in vllm, only finding even the latest pre-release vllm can not support GGUF with this Model architectures, which is already available in GPTQ or AWQ format
(APIServer pid=59168) Value error, Model architectures ['Qwen3MoeForCausalLM'] failed to be inspected. Please check the logs for more details. [type=value_error, input_value=ArgsKwargs((), {'model': ...gits_processors': None}), input_type=ArgsKwargs]

I do not want to switch to llama.cpp due to toolchain reason.
In the past, it was impossible to deliver reasonable performace using 2-bit model. How ever after autoround, it's now worth a serious trying.
I am alway a autoround advocator, could you please consider making a 2-bit model for vllm?
what a have is 4xRTX 3090 24GB =96GB . The 2-bit Deepseek 0528 is impossibly large for me .

Thank you very much.

Due to kernel limitations, cuda 2-bit models is much slower than 4-bit and currently only supports the FP16 dtype, which can lead to accuracy issues. This is why we prioritize gguf q2k-s.

1 You can try AutoRound in RTN mode (iters=0) to quickly check whether the model meets your needs. RTN is calibration-free and very fast, but at 2 bits it may cause significant accuracy loss.

2 In practice, we have successfully quantized a 235B model on a single 80GB card. You can also quantize the model yourself, see our README for instructions on using device_map to distribute across multiple GPUs.

Thank you very much. I will have a try.

Intel org

You’re welcome. If any issues come up on your side, we’ll try to generate the model when we’re available, typically on weekends.

Sign up or log in to comment