GGUF When?
When is the GGUF quantized version releasing?
https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.
https://huggingface.co/openbmb/MiniCPM-V-4_5-gguf
We've received your question. We've provided support for gguf in this repository. There should be a lot of it. The cache may not have been refreshed. Please check again.
May I ask when the issue of gguf in llama-server being unable to disable reasoning mode after startup will be resolved? Thank you very much.
@zhouxihong
I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.
@zhouxihong I noticed that llama.cpp already includes instructions for using think mode.
You can disable think mode for the model by specifying "export LLAMA_ARG_THINK=0" to disable the think mode environment variable.
Hope this helps. If you have any further questions, feel free to raise an issue.
Under normal circumstances, the “disable reasoning” mode provided by llama.cpp is already sufficient, especially for text models. However, for minicpm-v-4.5, using --reasoning-budget 0
still does not work. As for the LLAMA_ARG_THINK environment variable, I tried setting it to 0
, but after doing so the program immediately threw an error. Upon checking, this variable seems to only accept two parameters: none or deepseek, which are used to control the reasoning format. It still cannot be used to disable reasoning:
E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64>llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from E:\Downloads\llama-b6337-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
error while handling environment variable "LLAMA_ARG_THINK": Unknown reasoning format: 0
@zhouxihong
I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “[email protected]”.
My steps are as follows:
First, I start llama-server normally (the command I ran is shown above, and it is indeed running on GPU). Then I use an image for dialogue. Even with --reasoning-budget 0, the option has no effect. What’s strange is that for some images reasoning can actually be disabled, but in most cases, as soon as the image is slightly more complex, reasoning can no longer be disabled.
@zhouxihong I see. I may not have tested the GPU.
I'll take a look at what the error is.
If it's convenient, could you send me your test data?
My email address is “[email protected]”.
I noticed that the text model of minicpm-v-4.5 is qwen3, so I used a qwen3 Jinja template and forced it to be passed during llama.cpp initialization to avoid reasoning. This method works. The command is as follows:
llama-server.exe --port 8011 --no-mmap -ngl 200 -c 8192 -n 8192 -m E:\Downloads\minicpm-v-4.5.gguf --mmproj E:\Downloads\minicpm-v-4.5_mmproj-model-f16.gguf --chat-template-file E:\Downloads\qwen3_nonthinking.jinja --jinja
However, the --reasoning-budget 0
setting is ineffective, which might be a bug in llama.cpp regarding the reasoning switch for multimodal models. As a temporary workaround, the Jinja template method above can be used.
@zhouxihong
I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.
@zhouxihong I think it can't fully reuse qwen3 templates.
I used export LLAMA_ARG_THINK=0 and it worked, but I haven't tested whether it also works with llama-server. I'll test that later.
Alright, I’m really looking forward to the official fix from the minicpm team. Many thanks!
@zhouxihong Thank you for your understanding. We will submit the changes to llama.cpp as soon as possible.