I get errors trying to deploy this in vllm or sglang.
vllm serve models/Qwen3-235B-A22B-GPTQ-Int4 --served-model-name Q3-235B -tp 4 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --host 0.0.0.0 --port 8000 --enable-auto-tool-choice --tool-call-parser hermes
NFO 05-12 11:02:37 [gpu_model_runner.py:1347] Model loading took 28.9794 GiB and 23.962294 seconds
(VllmWorker rank=1 pid=1051482) INFO 05-12 11:02:37 [gpu_model_runner.py:1347] Model loading took 28.9794 GiB and 23.978429 seconds
(VllmWorker rank=3 pid=1051484) INFO 05-12 11:02:37 [gpu_model_runner.py:1347] Model loading took 28.9794 GiB and 23.995874 seconds
(VllmWorker rank=0 pid=1051481) INFO 05-12 11:02:37 [gpu_model_runner.py:1347] Model loading took 28.9794 GiB and 24.045420 seconds
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] WorkerProc hit an exception.
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/symbolic_conver
t.py", line 962, in step
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] self.dispatch_table[inst.opcode](self, inst)
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/symbolic_conver
t.py", line 659, in wrapper
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] return inner_fn(self, inst)
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/symbolic_conver
t.py", line 2341, in CALL
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] self._call(inst)
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/symbolic_conver
t.py", line 2335, in _call
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] self.call_function(fn, args, kwargs)
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/symbolic_conver
t.py", line 897, in call_function
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type]
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/variables/nn_mo
dule.py", line 914, in call_function
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] return variables.UserFunctionVariable(fn, source=source).call_function(
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/variables/funct
ions.py", line 317, in call_function
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] return super().call_function(tx, args, kwargs)
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/variables/funct
ions.py", line 118, in call_function
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] File "/home/ioplex/conda_envs/vllm-reg/lib/python3.12/site-packages/torch/_dynamo/symbolic_conver
t.py", line 903, in inline_user_function_return
(VllmWorker rank=3 pid=1051484) ERROR 05-12 11:02:37 [multiproc_executor.py:470] return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
INFO 05-12 12:49:23 [gptq_marlin.py:238] Using MarlinLinearKernel for GPTQMarlinLinearMethod
[2025-05-12 12:49:23 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2230, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 274, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 64, in init
self.worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 78, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 191, in init
self.initialize(min_per_gpu_memory)
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 206, in initialize
self.load_model()
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 468, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/model_loader/init.py", line 22, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 372, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 153, in _initialize_model
return model_class(
^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 328, in init
self.model = Qwen3MoeModel(
^^^^^^^^^^^^^^
File "/home/ioplex/conda_envs/sglang-latest/lib/python3.12/site-packages/sglang/srt/models/qwen3_moe.py", line 307, in init
super().init(
I was able to get this to work in vllm, using latest version.
export VLLM_USE_V1=0 // disable V1
vllm==0.8.5
transformers==4.51.3
there is a bug in this file .../vllm/model_executor/layers/quantization/gptq_marlin.py
replace with https://huggingface.co/getfit/Marlin-VLLM-Fix/blob/main/gptq_marlin.py
Thanks
@getfit
, I had the exact same problem coming also from trying with 4xA6000 ADA, and your solution works perfectly.
Though, it slows down significantly from zero context at 6x t/s to maybe 15 t/s after just a small amount of context in my case.
Update:
With help from ChatGPT, I got a config that allows me to maintain > 50 t/s even when context has gotten very long (not counted, but at least 10K+)
vllm serve Qwen/Qwen3-235B-A22B-GPTQ-Int4
--tensor-parallel-size 4
--enable-expert-parallel
--max-model-len 131072
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
--max-seq-len-to-capture 131072
--enable-chunked-prefill
--max-num-batched-tokens 2048
--port 8000
update: adding KV cache FP8 quantization flag:
--kv-cache-dtype fp8_e5m2