llama-cpp: no <think> opening

by owao - opened 13 days ago

owao

13 days ago

Hey guys, I quantized your model using https://huggingface.co/spaces/ggml-org/gguf-my-repo (updated every 6 hours to latest build).
I'm using the template from this repo, where I indeed see the logic where <think> is prepended when enable_reasoning is true or not set. I'm also setting the system prompt you recommended:

You are a helpful AI assistant. You always reason before responding, using the following format:

<think>
your internal reasoning
</think>
your external response

However, despite that, I can't make the output contain the opening tag .

Here is my llama-cpp serving command:

      /home/user/llama.cpp/build/bin/llama-server \
      --model /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/cwm-Q4_K_M-GGUF/cwm-q4_k_m.gguf \
      --ctx-size 16000 \
      --no-context-shift \
      --n-gpu-layers 100 \
      --temp 0.1 \
      --jinja \
      --chat-template-file /mnt/277c6bdc-56fd-45a3-9195-3612028a5a15/GGUFs/cwm-Q4_K_M-GGUF/template.jinja \
      --host 0.0.0.0 \
      --port ${PORT} \
      --flash-attn on \
      --chat-template-kwargs '{"enable_thinking":true}' # not required

I'm stuck there. I can't try to infer through vllm cause I don't have enough RAM. Does anyone has some clues of what's going on or should I submit an issue @llama-cpp?

Thanks!

jacobkahn

AI at Meta org 10 days ago

Hey @owao - first, until it's merged, please patch https://github.com/huggingface/transformers/pull/41199 and use with HF to ensure correctness. The model type being Llama3ForCausalLM will still "work" with HF as is, but without certain sliding window attention parameters properly set. We're hoping to merge this PR tomorrow, which will fix these issues and define a specific model type.

With respect to the output containing the opening <think> tag, this will never be the case since it's defined in the chat template and is injected automatically after the assistant header. This is by design.

owao

10 days ago

@jacobkahn Thanks for the insight :) I hope you'll be able to merge without too much difficulties.
Regarding the opening <think> tag: I understood the model won't generate it, but I couldn't make it prepended to the output even with the template supposed to handle it.

I'll wait until your PR gets merged and will watch if there is specific issue with llama-cpp because I know they don't use ninja but minja, so maybe the issue was coming from that, but that's just a blind guess.

Thanks again for your answer.

jacobkahn changed discussion status to closed 1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment