GGUF When?

by xTimeCrystal - opened 15 days ago

Discussion

xTimeCrystal

15 days ago

When is the GGUF quantized version releasing?

l33tkr3w

15 days ago

already is one, https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf

tc-mb

OpenBMB org 15 days ago

https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.

zhouxihong

9 days ago

https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.

May I ask when the issue of gguf in llama-server being unable to disable reasoning mode after startup will be resolved? Thank you very much.

tc-mb

OpenBMB org 9 days ago

@zhouxihong Ok, I will submit a PR to resolve it before Wednesday.

tc-mb

OpenBMB org 8 days ago

@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.

zhouxihong

8 days ago

@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.

Under normal circumstances, the “disable reasoning” mode provided by llama.cpp is already sufficient, especially for text models. However, for minicpm-v-4.5, using --reasoning-budget 0 still does not work. As for the LLAMA_ARG_THINK environment variable, I tried setting it to 0, but after doing so the program immediately threw an error. Upon checking, this variable seems to only accept two parameters: none or deepseek, which are used to control the reasoning format. It still cannot be used to disable reasoning:

E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64>llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
error while handling environment variable "LLAMA_ARG_THINK": Unknown reasoning format: 0

tc-mb

OpenBMB org 8 days ago

@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “[email protected]”.

zhouxihong

8 days ago

My steps are as follows:
First, I start llama-server normally (the command I ran is shown above, and it is indeed running on GPU). Then I use an image for dialogue. Even with --reasoning-budget 0, the option has no effect. What’s strange is that for some images reasoning can actually be disabled, but in most cases, as soon as the image is slightly more complex, reasoning can no longer be disabled.

zhouxihong

8 days ago

@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “[email protected]”.

I noticed that the text model of minicpm-v-4.5 is qwen3, so I used a qwen3 Jinja template and forced it to be passed during llama.cpp initialization to avoid reasoning. This method works. The command is as follows:

llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf --chat-template-file E:\Downloads\qwen3_nonthinking.jinja --jinja

However, the --reasoning-budget 0 setting is ineffective, which might be a bug in llama.cpp regarding the reasoning switch for multimodal models. As a temporary workaround, the Jinja template method above can be used.

tc-mb

OpenBMB org 8 days ago

@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.

zhouxihong

8 days ago

@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.

Alright, I’m really looking forward to the official fix from the minicpm team. Many thanks!

tc-mb

OpenBMB org 8 days ago

@zhouxihong Thank you for your understanding. We will submit the changes to llama.cpp as soon as possible.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment