Doesn't work with vllm
Hey, thanks for the upload!
It doesn't seem to work with vllm following the readme. Does it require building vllm from master?
This repo can be served with the default build. If your machine can't run, please consider
(1) make sure you really created a new python venv
(2) try vllm serve the original qwen3 vl a3b, see if it works, if not, your py vllm environment is not correctly installed
Yeah the issue was uv pip, switching to pip fixes the issue (+ installing other dependencies).
Unfortunately it runs out of vram on the L40s (48GB), normally the 4-bit version should fit in a 24GB GPU.
Glad to hear that
"Unfortunately it runs out of vram on the L40s (48GB)"
Have you tried setting "--limit-mm-per-prompt.video 0" as suggested from vLLM Qwen3-VL Usage Guide
I happen to have a sm89 48GB device next to me, so I just ran a quick check,
install using:
uv venv
source .venv/bin/activate
# Install vLLM >=0.11.0
uv pip install -U vllm
# Install Qwen-VL utility library (recommended for offline inference)
uv pip install qwen-vl-utils==0.0.14
serve using
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export OMP_NUM_THREADS=4
vllm serve \
PATH/QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
--served-model-name MY_MODEL \
--swap-space 4 \
--max-num-seqs 8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 1 \
--distributed-executor-backend mp \
--trust-remote-code \
--disable-log-requests \
--host 0.0.0.0 \
--port 8000
Having no issue whatsoever. Just for your reference.