The correct way of fine-tuning on multi-turn trajectories
#11
by
hr0nix
- opened
Looking at the qwen 3 chat template, the last assistant turn always has <think></think>
tags, even in non-thinking mode, while the intermediate assistant turn never include reasoning traces and tags. This creates an asymmetry between the last assistant turn and all previous turns. And this asymmetry makes it unclear how to fine-tune this model on multi-turn trajectories: if one just does it by training on the whole trajectory with assistant turn masking, the intermediate turns will be OOD as they won't have thinking tags.
What's the recommended approach here? Should we just always train on the last turn or should we simply ignore this asymmetry?