⚠️ Chat template foot gun with DeepSeek distilled models and RL format reward function

#17
by lewtun HF staff - opened
Open R1 org

I've been wondering why the DeepSeek distilled models don't seem to follow our format reward function, while other models like Qwen/Llama do almost immediately. As with most problems in post-training, the source of the problem is the chat template 🫠!

Let's start by taking a look at the answer to 1+1:

from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", device="cuda")

out = pipe([{"role": "user", "content": "What is 1+1?"}], max_new_tokens=512, return_full_text=True)
print(out)
[
    {"role": "user", "content": "What is 1+1?"},
    {
        "role": "assistant",
        "content": "First, I recognize that the question is asking for the sum of 1 and 1.\n\nI know that addition is the process of combining numbers to find their total.\n\nSo, I will add 1 and 1 together.\n\n1 plus 1 equals 2.\n\nTherefore, the answer is 2.\n</think>\n\n**Solution:**\n\nTo find the sum of 1 and 1, follow these steps:\n\n1. **Add the numbers together:**\n   \n   \\[\n   1 + 1 = 2\n   \\]\n\n2. **State the final answer:**\n\n   \\[\n   \\boxed{2}\n   \\]",
    },
]

As you can see in the assistant response, we have the </think> tag appearing as expected. Note that there is no opening <think> tag because DeepSeek prefills the assistant response in the chat template:

Screenshot 2025-02-14 at 21.38.33.png

However, when you feed these messages into the tokenizer, the part contained in the think tags is hidden

print(pipe.tokenizer.apply_chat_template(out, tokenize=False))
<|begin▁of▁sentence|><|User|>What is 1+1?<|Assistant|>

**Solution:**

To find the sum of 1 and 1, follow these steps:

1. **Add the numbers together:**
   
   \[
   1 + 1 = 2
   \]

2. **State the final answer:**

   \[
   \boxed{2}
   \]<|end▁of▁sentence|>

Now why is that? Well in the chat template, DeepSeek set the content of the assistant response to be everything after the closing </think> token:

Screenshot 2025-02-14 at 21.45.45.png

For GRPO training, we wrap the completions in the chat template, so the model cannot satisfy the requirements of the formatting function 🙈. I also realised the following: our formatting function requires the answer to be wrapped in <answer> and </answer> tags. These are induced by the default system prompt, but are not the natural output of the DeepSeek models. I think it's fine to keep them, but it's something to be aware of.

Conclusion

  • The formatting reward is likely not useful / needed for the DeepSeek distilled models
  • If you want to still use it, you should probably override the chat template to display the <think> tokens
Open R1 org

Well spotted @lewtun !

Speaking of formatting reward functions, I have noticed something interesting when applying GRPO to the SmolLM2-1.7B-Instruct models, with a warmup of 0.1 the model learns both the accuracy reward and the format rewards, but I wanted to train for more samples, so I lowered the warmup period as it is a ratio of the total samples. In one ablation which has no warmup, the accuracy reward is grokked more quickly but the format reward stays at 0 and the model does not learn the template, so there might need to be some careful tuning to find the right balance:

image.png

Open R1 org

For the curious, here's the logic in TRL where we apply the reward function to the prompts and completions: https://github.com/huggingface/trl/blob/ffcb9f4aee725a2bd072d0387afe68a4b1c7967c/trl/trainer/grpo_trainer.py#L623

Since our format reward function is only applied to the completion AND the DeepSeek chat template prefills the <think> token in the prompt, we have little chance of matching the regex

Open R1 org

Something like this does the trick:

def format_reward_deepseek(prompts, completions, **kwargs):
    pattern = r"^<think>.*?</think>$"
    # We prepend a <think> tag because DeepSeek models prefill the assistant response with and thus it's absent in the completion
    completion_contents = [f'<think>\n{completion[0]["content"]}' for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]
Open R1 org

I also now understand why DeepSeek pre-fills their distilled model chat templates with <think>\n: if you don't do this, the SFT model will typically fail to follow the template and revert back to whatever format was used to train the original instruct model.

A better alternative to hacking the chat template is to use the continue_final_message feature as follows:

pipe([{"role": "user", "content": PROMPT}, {"role": "assistant", "content": "<think>\n"}], continue_final_message=True)
Open R1 org

We ended up allowing users to set their own chat template in https://github.com/huggingface/open-r1/pull/349

Well spotted @lewtun !

Speaking of formatting reward functions, I have noticed something interesting when applying GRPO to the SmolLM2-1.7B-Instruct models, with a warmup of 0.1 the model learns both the accuracy reward and the format rewards, but I wanted to train for more samples, so I lowered the warmup period as it is a ratio of the total samples. In one ablation which has no warmup, the accuracy reward is grokked more quickly but the format reward stays at 0 and the model does not learn the template, so there might need to be some careful tuning to find the right balance:

image.png

I bumped into similar issues here where the format reward is always zero..

Sign up or log in to comment