The correct way of fine-tuning on multi-turn trajectories

#11
by hr0nix - opened

Looking at the qwen 3 chat template, the last assistant turn always has <think></think> tags, even in non-thinking mode, while the intermediate assistant turn never include reasoning traces and tags. This creates an asymmetry between the last assistant turn and all previous turns. And this asymmetry makes it unclear how to fine-tune this model on multi-turn trajectories: if one just does it by training on the whole trajectory with assistant turn masking, the intermediate turns will be OOD as they won't have thinking tags.

What's the recommended approach here? Should we just always train on the last turn or should we simply ignore this asymmetry?

You can fine-tune a multiturn trajectory by splitting it into multiple examples, and remove all <think></think> tags of history turns.
So instead of train on the following,
image.png
train on the following, where blue ones are masked, red ones are trained.
image.png
Hope this helps!

Sign up or log in to comment