Spaces:
Running
⚠️ Chat template foot gun with DeepSeek distilled models and RL format reward function
I've been wondering why the DeepSeek distilled models don't seem to follow our format reward function, while other models like Qwen/Llama do almost immediately. As with most problems in post-training, the source of the problem is the chat template 🫠!
Let's start by taking a look at the answer to 1+1:
from transformers import pipeline
pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device="cuda")
out = pipe([{"role": "user", "content": "What is 1+1?"}], max_new_tokens=512, return_full_text=True)
print(out)
[
{"role": "user", "content": "What is 1+1?"},
{
"role": "assistant",
"content": "First, I recognize that the question is asking for the sum of 1 and 1.\n\nI know that addition is the process of combining numbers to find their total.\n\nSo, I will add 1 and 1 together.\n\n1 plus 1 equals 2.\n\nTherefore, the answer is 2.\n</think>\n\n**Solution:**\n\nTo find the sum of 1 and 1, follow these steps:\n\n1. **Add the numbers together:**\n \n \\[\n 1 + 1 = 2\n \\]\n\n2. **State the final answer:**\n\n \\[\n \\boxed{2}\n \\]",
},
]
As you can see in the assistant response, we have the </think>
tag appearing as expected. Note that there is no opening <think>
tag because DeepSeek prefills the assistant response in the chat template:
However, when you feed these messages into the tokenizer, the part contained in the think tags is hidden
print(pipe.tokenizer.apply_chat_template(out, tokenize=False))
<|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>
**Solution:**
To find the sum of 1 and 1, follow these steps:
1. **Add the numbers together:**
\[
1 + 1 = 2
\]
2. **State the final answer:**
\[
\boxed{2}
\]<|end▁of▁sentence|>
Now why is that? Well in the chat template, DeepSeek set the content of the assistant response to be everything after the closing </think>
token:
For GRPO training, we wrap the completions in the chat template, so the model cannot satisfy the requirements of the formatting function 🙈. I also realised the following: our formatting function requires the answer to be wrapped in <answer>
and </answer>
tags. These are induced by the default system prompt, but are not the natural output of the DeepSeek models. I think it's fine to keep them, but it's something to be aware of.
Conclusion
- The formatting reward is likely not useful / needed for the DeepSeek distilled models
- If you want to still use it, you should probably override the chat template to display the
<think>
tokens
Well spotted @lewtun !
Speaking of formatting reward functions, I have noticed something interesting when applying GRPO to the SmolLM2-1.7B-Instruct models
, with a warmup of 0.1 the model learns both the accuracy reward and the format rewards, but I wanted to train for more samples, so I lowered the warmup period as it is a ratio of the total samples. In one ablation which has no warmup, the accuracy reward is grokked more quickly but the format reward stays at 0 and the model does not learn the template, so there might need to be some careful tuning to find the right balance:
For the curious, here's the logic in TRL where we apply the reward function to the prompts and completions: https://github.com/huggingface/trl/blob/ffcb9f4aee725a2bd072d0387afe68a4b1c7967c/trl/trainer/grpo_trainer.py#L623
Since our format reward function is only applied to the completion AND the DeepSeek chat template prefills the <think>
token in the prompt, we have little chance of matching the regex
Something like this does the trick:
def format_reward_deepseek(prompts, completions, **kwargs):
pattern = r"^<think>.*?</think>$"
# We prepend a <think> tag because DeepSeek models prefill the assistant response with and thus it's absent in the completion
completion_contents = [f'<think>\n{completion[0]["content"]}' for completion in completions]
matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
return [1.0 if match else 0.0 for match in matches]
I also now understand why DeepSeek pre-fills their distilled model chat templates with <think>\n
: if you don't do this, the SFT model will typically fail to follow the template and revert back to whatever format was used to train the original instruct model.
A better alternative to hacking the chat template is to use the continue_final_message
feature as follows:
pipe([{"role": "user", "content": PROMPT}, {"role": "assistant", "content": "<think>\n"}], continue_final_message=True)
We ended up allowing users to set their own chat template in https://github.com/huggingface/open-r1/pull/349
Well spotted @lewtun !
Speaking of formatting reward functions, I have noticed something interesting when applying GRPO to the
SmolLM2-1.7B-Instruct models
, with a warmup of 0.1 the model learns both the accuracy reward and the format rewards, but I wanted to train for more samples, so I lowered the warmup period as it is a ratio of the total samples. In one ablation which has no warmup, the accuracy reward is grokked more quickly but the format reward stays at 0 and the model does not learn the template, so there might need to be some careful tuning to find the right balance:
I bumped into similar issues here where the format reward is always zero..