./ggml-model-Q6_K.gguf is not a multimodal model

#4
by robert1968 - opened

Hi,
I run VLLM as API endpoint
vllm serve ./ggml-model-Q6_K.gguf --tokenizer openbmb/MiniCPM-V-4_5
using Open WebUI as client and uploading a jpg file but
VLLM report this is not a multimodal model.
See error:
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] Error in preprocessing prompt inputs
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] Traceback (most recent call last):
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 220, in create_chat_completion
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ) = await self._preprocess_chat(
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 869, in _preprocess_chat
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] conversation, mm_data_future = parse_chat_messages_futures(
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 1247, in parse_chat_messages_futures
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] sub_messages = _parse_chat_message_content(
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 1165, in _parse_chat_message_content
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] result = _parse_chat_message_content_parts(
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 1054, in _parse_chat_message_content_parts
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] parse_res = _parse_chat_message_content_part(
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 1119, in _parse_chat_message_content_part
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] mm_parser.parse_image(str_content)
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 757, in parse_image
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] placeholder = self._tracker.add("image", image_coro)
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 565, in add
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] self.mm_processor.validate_num_items(input_modality, num_items)
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/usr/lib/python3.12/functools.py", line 995, in get
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] val = self.func(instance)
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 555, in mm_processor
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] return self.mm_registry.create_processor(self.model_config)
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] File "/adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 312, in create_processor
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] raise ValueError(f"{model_config.model} is not a multimodal model")
(APIServer pid=11755) ERROR 08-31 20:07:35 [serving_chat.py:245] ValueError: ./ggml-model-Q6_K.gguf is not a multimodal model
(APIServer pid=11755) /adat/ai/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py:246: RuntimeWarning: coroutine 'MediaConnector.fetch_image_async' was never awaited
(APIServer pid=11755) return self.create_error_response(f"{e} {e.cause}")
(APIServer pid=11755) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(APIServer pid=11755) INFO: 172.17.0.2:53670 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

OpenBMB org

@robert1968 Yes, you're indeed using an LLM model. Our vit component is https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf/blob/main/mmproj-model-f16.gguf, which is used with llama.cpp, so it's separate.
I haven't tested using gguf directly with VLLM.

Hi,

Thanks for response.
Sorry but I dont understand your response in any way :)
What "vit component" mean?
why do you f16.gguf mention - if i refer ggml-model-Q6_K.gguf? and
why llama.cpp is important in this context?

Therotically gguf version is a quantized version of the openbmb/MiniCPM-V-4_5 - so it should be still a multimodal model.
Isn't this a correct theoretical assumption?

And beware when i load the unquantized version it provided answer for a picture upload!
( vllm serve openbmb/MiniCPM-V-4_5 --trust-remote-code --gpu-memory-utilization 0.95 --max-model-len 16000 --max-num-seqs 1)

best regards.

OpenBMB org
β€’
edited 8 days ago

@robert1968
Hi,
Let me answer your question.
The author of llama.cpp is also the author of gguf and ggml. Therefore, by providing the gguf format, we prioritize the use of llama.cpp. We believe this is a way of respecting the original author.

When using the multimodal model in llama.cpp, they use a command line similar to the following:
./llama-mtmd-cli -m ../MiniCPM-V-4_5/model/Model-3.6B-F16.gguf --mmproj ../MiniCPM-V-4_5/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"

This means that the LLM and vision components need to be loaded separately, so the gguf I provide is also split in this format. The vision component includes the VIT and resampler weights and is the visual module of the multimodal model.

If you use VLLM, I actually recommend using the floating-point model directly. I noticed that you have also run it successfully, and I hope this will meet your needs. If your GPU memory is limited, you can also try the awq quantized model. We also provide a quantized code repository.

I hope this helps you. Feel free to add any questions.

best regards.

Hi,

Many thanks for explaining so detailed! It is clear now.
got your points.

This means that the LLM and vision components need to be loaded separately,

Ahh that is very interesting and new to me. Many thanks to explain!

My first impression was mediocre with vllm with the MiniCPM-V-4_5 unquantized version.
it took 1 minute to load the model and start serving on API endpoint. It is far slower than my user experience with ollama and LM Studio.

For the model experience MiniCPM-V-4_5, it was not so bad, in my very short test. (2 pictures only.)
But when i ask to "provide all the text from the picture", it do an excellent job on OCR but started loop on one of the last sentences.
So i wont use vllm but probably try llama.ccp as you described, or wait to be available as LM studio model....

Many thanks for your patience :)

you can close this discussion - if you want.

best regards.

OpenBMB org

@robert1968
Thank you for your feedback. If you can verify the model's performance using the demo I created, it might be helpful in directly assessing its effectiveness.
https://minicpm-v.openbmb.cn/

If you have any other questions, please feel free to file an issue.

Also, I've heard a lot about the LM Studio framework in recent issues. I haven't used it before, but if it works well, I'll consider adapting it and incorporating it into the framework that needs to be updated every time we release a model.

Hi,I use the model ggml-model-Q5_K_M.gguf in lmstudio.
When I delete the mmproj-model-f16.gguf file, it can be loaded. Does it mean that LM Studio does not support it?
image.png

OpenBMB org

@noenglishname I did not adapt lm stdio

Sign up or log in to comment