Model support with vllm

#1
by palindromeRice05 - opened

Hi , wanted to know if the model is supported through vllm as I was having trouble with it.

The command i am using on my L4 GPU.

vllm serve SoybeanMilk/Kimi-VL-A3B-Thinking-2506-BNB-4bit
--host 0.0.0.0
--port 8003
--trust-remote-code
--gpu-memory-utilization 0.90
--max-model-len 6144
--disable-mm-preprocessor-cache
--limit-mm-per-prompt '{"image": 1, "video": 0}'

I believe moe model with bnb-4 bit quantization is not yet supported by vllm as I am persistently encountering this error ->

AssertionError: quant_method is not None

and also as per this recent discussion
https://discuss.vllm.ai/t/moe-quantization/594

0.10.0 support!!!

Hi @palindromeRice05 ,
Thanks for reaching out and for trying out the model! I'm the author of this 4-bit quantized model, and I just wanted to confirm that you're absolutely correct โ€” when I tried it about a month ago, vLLM still did not support MoE (Mixture of Experts) models quantized with bitsandbytes (BNB) 4-bit.
I think that this limitation is on vLLM's side, not your setup.

That said, Iโ€™ve prepared a working Colab notebook that demonstrates how to successfully load and run the model using standard transformers + bitsandbytes, with full inference support:
๐Ÿ‘‰ https://colab.research.google.com/drive/1WAebQWzWmHGVlL2mi3rukWpw1195W4AC?usp=sharing

While vLLM doesn't support this combination yet, the Colab shows a reliable alternative for deployment or testing.

Sign up or log in to comment