Model support with vllm
Hi , wanted to know if the model is supported through vllm as I was having trouble with it.
The command i am using on my L4 GPU.
vllm serve SoybeanMilk/Kimi-VL-A3B-Thinking-2506-BNB-4bit
--host 0.0.0.0
--port 8003
--trust-remote-code
--gpu-memory-utilization 0.90
--max-model-len 6144
--disable-mm-preprocessor-cache
--limit-mm-per-prompt '{"image": 1, "video": 0}'
I believe moe model with bnb-4 bit quantization is not yet supported by vllm as I am persistently encountering this error ->
AssertionError: quant_method is not None
and also as per this recent discussion
https://discuss.vllm.ai/t/moe-quantization/594
0.10.0 support!!!
Hi
@palindromeRice05
,
Thanks for reaching out and for trying out the model! I'm the author of this 4-bit quantized model, and I just wanted to confirm that you're absolutely correct โ when I tried it about a month ago, vLLM still did not support MoE (Mixture of Experts) models quantized with bitsandbytes (BNB) 4-bit.
I think that this limitation is on vLLM's side, not your setup.
That said, Iโve prepared a working Colab notebook that demonstrates how to successfully load and run the model using standard transformers + bitsandbytes, with full inference support:
๐ https://colab.research.google.com/drive/1WAebQWzWmHGVlL2mi3rukWpw1195W4AC?usp=sharing
While vLLM doesn't support this combination yet, the Colab shows a reliable alternative for deployment or testing.