Doesn't work with vllm

by mobicham - opened 8 days ago

8 days ago

Hey, thanks for the upload!
It doesn't seem to work with vllm following the readme. Does it require building vllm from master?

tclf90

QuantTrio org 7 days ago

•

edited 7 days ago

This repo can be served with the default build. If your machine can't run, please consider
(1) make sure you really created a new python venv
(2) try vllm serve the original qwen3 vl a3b, see if it works, if not, your py vllm environment is not correctly installed

mobicham

6 days ago

Yeah the issue was uv pip, switching to pip fixes the issue (+ installing other dependencies).
Unfortunately it runs out of vram on the L40s (48GB), normally the 4-bit version should fit in a 24GB GPU.

tclf90

QuantTrio org 6 days ago

Glad to hear that

"Unfortunately it runs out of vram on the L40s (48GB)"
Have you tried setting "--limit-mm-per-prompt.video 0" as suggested from vLLM Qwen3-VL Usage Guide

tclf90

QuantTrio org 6 days ago

•

edited 6 days ago

I happen to have a sm89 48GB device next to me, so I just ran a quick check,

install using:

uv venv
source .venv/bin/activate

# Install vLLM >=0.11.0
uv pip install -U vllm

# Install Qwen-VL utility library (recommended for offline inference)
uv pip install qwen-vl-utils==0.0.14

serve using

export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export OMP_NUM_THREADS=4
vllm serve \
    PATH/QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ \
    --served-model-name MY_MODEL \
    --swap-space 4 \
    --max-num-seqs 8 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 1 \
    --distributed-executor-backend mp \
    --trust-remote-code \
    --disable-log-requests \
    --host 0.0.0.0 \
    --port 8000

Having no issue whatsoever. Just for your reference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment