llama-cpp: no <think> opening
Hey guys, I quantized your model using https://huggingface.co/spaces/ggml-org/gguf-my-repo (updated every 6 hours to latest build).
I'm using the template from this repo, where I indeed see the logic where <think>
is prepended when enable_reasoning
is true or not set. I'm also setting the system prompt you recommended:
You are a helpful AI assistant. You always reason before responding, using the following format:
<think>
your internal reasoning
</think>
your external response
However, despite that, I can't make the output contain the opening tag .
Here is my llama-cpp serving command:
/home/user/llama.cpp/build/bin/llama-server \
--model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/cwm-Q4_K_M-GGUF/cwm-q4_k_m.gguf \
--ctx-size 16000 \
--no-context-shift \
--n-gpu-layers 100 \
--temp 0.1 \
--jinja \
--chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/cwm-Q4_K_M-GGUF/template.jinja \
--host 0.0.0.0 \
--port ${PORT} \
--flash-attn on \
--chat-template-kwargs '{"enable_thinking":true}' # not required
I'm stuck there. I can't try to infer through vllm cause I don't have enough RAM. Does anyone has some clues of what's going on or should I submit an issue @llama-cpp?
Thanks!
Hey
@owao
- first, until it's merged, please patch https://github.com/huggingface/transformers/pull/41199 and use with HF to ensure correctness. The model type being Llama3ForCausalLM
will still "work" with HF as is, but without certain sliding window attention parameters properly set. We're hoping to merge this PR tomorrow, which will fix these issues and define a specific model type.
With respect to the output containing the opening <think>
tag, this will never be the case since it's defined in the chat template and is injected automatically after the assistant header. This is by design.
@jacobkahn
Thanks for the insight :) I hope you'll be able to merge without too much difficulties.
Regarding the opening <think>
tag: I understood the model won't generate it, but I couldn't make it prepended to the output even with the template supposed to handle it.
I'll wait until your PR gets merged and will watch if there is specific issue with llama-cpp because I know they don't use ninja but minja, so maybe the issue was coming from that, but that's just a blind guess.
Thanks again for your answer.