What engine can I use to deploy this model?

#12
by jjovalle99 - opened

What engine can I use to deploy this model?

the new version of vllm does support this model.

I am getting the error below when I tried using vllm docker image.

File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py", line 201, in get_config
raise ValueError(f"No supported config format found in {model}")
ValueError: No supported config format found in Qwen/Qwen2.5-VL-7B-Instruct

This is how I use it @sulaymantaofiq (Im copying and pasting a thread that i wrote in slack)

First, to make them work you will need to setup your environment like this (this is specific to uv, but it shouldn't be different with other dependency manager):

uv venv --python 3.12.8
source .venv/bin/activate
uv pip install vllm # ---> Now there is no need to install from source because of the latest release
uv pip install flash-attn --no-build-isolation # ---> Otherwise it will use xformers, or you can use flashinfer with uv pip install flashinfer-python
uv pip install "git+https://github.com/huggingface/transformers" # ---> This needs to be the last step, at least for now, once transformers release a new version, then you can just uv pip install transformers

Then specify that you want to use V1:

export VLLM_USE_V1=1

Finally you serve the model:

vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype bfloat16 \
    --tensor-parallel-size 4 # --> I was using 4xH100 SXM 80GB

Some dumb things that I have learned working with these models:
Limit the multi modal inputs with limit-mm-per-prompt for example:

--limit-mm-per-prompt image=5,video=0

This may sound straightforward but, be careful with the size of the images, as they will consume a lot of tokens. I was working with a 4000x3000 png, and it consumed about 16k tokens, which is a lot, and if you are going to support parallel requests, can be problematic (16k x 5 parallel requests :firecracker:). I resized this picture to approx 1000x800 and also changed the format to jpg, and the model was able to still get the correct response.
I have not test it yet, but I think you could use --mm-processor-kwargs to use Qwen features like min_pixels , max_pixels, resized_height and resized_width to do the resize on the fly. If someone has done this, please share the experience.
Also if i understand correctly, with V1 you could try to install Flash attention 3 (still in beta i think) and use it.

Finally, at inference time, I am using Instructor to get structured outputs, and the code looks something like:

import instructor
from openai import AsyncOpenAI
from pydantic import BaseModel

vllm_url = "http://...:8000/v1"
vllm_api_key = "empty"
model_name="Qwen/Qwen2.5-VL-72B-Instruct"
vllm_client = AsyncOpenAI(base_url=vllm_url, api_key=vllm_api_key)
instructor_client = instructor.from_openai(client=vllm_client, mode=instructor.Mode.JSON) # Not sure why it didnt work with instructor.Mode.TOOLS

class Response(BaseModel):
    reasoning: str
    answer: str

response = await instructor_client.chat.completions.create_with_completion(
    model=model_name,
    response_model=Response,
    messages=[
        {
            "role": "user",
            "content": ["You have extreme good vision. You will receive a set of images. Analyze them and then solve the question. {question}",
                        image_1, image2, image3],
        },
    ],
    max_tokens=2048,
    temperature=0.0,
)
```

Sign up or log in to comment