How to run deploy this model on vllm?
As for now I am using this code:
vllm serve unsloth/Qwen3-30B-A3B-bnb-4bit --enable-reasoning --reasoning-parser deepseek_r1 --dtype=bfloat16 --gpu_memory_utilization=0.95 --max-model-len=12554 --max-num-seqs=2 --quantization=bitsandbytes --load-format=bitsandbytes
and getting this issue:
WARNING 04-29 11:45:20 [utils.py:165] The model class Qwen3MoeForCausalLM has not defined packed_modules_mapping
, this may lead to incorrect mapping of quantized or ignored modules
ERROR 04-29 11:45:20 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-29 11:45:20 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/v1/engine/core.py", line 320, in init
ERROR 04-29 11:45:20 [core.py:387] super().init(vllm_config, executor_class, log_stats)
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/v1/engine/core.py", line 67, in init
ERROR 04-29 11:45:20 [core.py:387] self.model_executor = executor_class(vllm_config)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/executor/executor_base.py", line 52, in init
ERROR 04-29 11:45:20 [core.py:387] self._init_executor()
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-29 11:45:20 [core.py:387] self.collective_rpc("load_model")
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-29 11:45:20 [core.py:387] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/utils.py", line 2378, in run_method
ERROR 04-29 11:45:20 [core.py:387] return func(*args, **kwargs)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 04-29 11:45:20 [core.py:387] self.model_runner.load_model()
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/v1/worker/gpu_model_runner.py", line 1279, in load_model
ERROR 04-29 11:45:20 [core.py:387] self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/model_loader/init.py", line 14, in get_model
ERROR 04-29 11:45:20 [core.py:387] return loader.load_model(vllm_config=vllm_config)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/model_loader/loader.py", line 1289, in load_model
ERROR 04-29 11:45:20 [core.py:387] model = _initialize_model(vllm_config=vllm_config)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
ERROR 04-29 11:45:20 [core.py:387] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/models/qwen3_moe.py", line 489, in init
ERROR 04-29 11:45:20 [core.py:387] self.model = Qwen3MoeModel(vllm_config=vllm_config,
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/compilation/decorators.py", line 151, in init
ERROR 04-29 11:45:20 [core.py:387] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/models/qwen3_moe.py", line 335, in init
ERROR 04-29 11:45:20 [core.py:387] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/models/utils.py", line 610, in make_layers
ERROR 04-29 11:45:20 [core.py:387] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/models/qwen3_moe.py", line 337, in
ERROR 04-29 11:45:20 [core.py:387] lambda prefix: Qwen3MoeDecoderLayer(config=config,
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/models/qwen3_moe.py", line 279, in init
ERROR 04-29 11:45:20 [core.py:387] self.mlp = Qwen3MoeSparseMoeBlock(config=config,
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/models/qwen3_moe.py", line 114, in init
ERROR 04-29 11:45:20 [core.py:387] self.experts = FusedMoE(num_experts=config.num_experts,
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] File "/home/selangor/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 502, in init
ERROR 04-29 11:45:20 [core.py:387] assert self.quant_method is not None
ERROR 04-29 11:45:20 [core.py:387] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 11:45:20 [core.py:387] AssertionError
ERROR 04-29 11:45:20 [core.py:387]
CRITICAL 04-29 11:45:20 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed