SoybeanMilk/Kimi-VL-A3B-Thinking-2506-BNB-4bit

29 days ago

Hi , wanted to know if the model is supported through vllm as I was having trouble with it.

The command i am using on my L4 GPU.

vllm serve SoybeanMilk/Kimi-VL-A3B-Thinking-2506-BNB-4bit
--host 0.0.0.0
--port 8003
--trust-remote-code
--gpu-memory-utilization 0.90
--max-model-len 6144
--disable-mm-preprocessor-cache
--limit-mm-per-prompt '{"image": 1, "video": 0}'

palindromeRice05

29 days ago

•

edited 29 days ago

I believe moe model with bnb-4 bit quantization is not yet supported by vllm as I am persistently encountering this error ->

AssertionError: quant_method is not None

and also as per this recent discussion
https://discuss.vllm.ai/t/moe-quantization/594

lvbinandylau

5 days ago

0.10.0 support!!!

SoybeanMilk

Owner 4 days ago

•

edited 4 days ago

Hi @palindromeRice05 ,
Thanks for reaching out and for trying out the model! I'm the author of this 4-bit quantized model, and I just wanted to confirm that you're absolutely correct — when I tried it about a month ago, vLLM still did not support MoE (Mixture of Experts) models quantized with bitsandbytes (BNB) 4-bit.
I think that this limitation is on vLLM's side, not your setup.

That said, I’ve prepared a working Colab notebook that demonstrates how to successfully load and run the model using standard transformers + bitsandbytes, with full inference support:
👉 https://colab.research.google.com/drive/1WAebQWzWmHGVlL2mi3rukWpw1195W4AC?usp=sharing

While vLLM doesn't support this combination yet, the Colab shows a reliable alternative for deployment or testing.

SoybeanMilk
/

Kimi-VL-A3B-Thinking-2506-BNB-4bit

Model support with vllm