arxiv:2508.05629

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Published on Aug 7

· Submitted by

Liang0223 on Aug 8

#1 Paper of the day

Upvote

117

Authors:

Abstract

Dynamic Fine-Tuning (DFT) improves the generalization of Large Language Models (LLMs) by dynamically rescaling gradients, outperforming standard Supervised Fine-Tuning (SFT) and showing competitive results in offline reinforcement learning.

AI-generated summary

We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.

View arXiv page View PDF GitHub 132 Add to collection

Community

Liang0223

Paper submitter 3 days ago

cmgzy

2 days ago

Have you tested on general LLMs such as qwen2.5/3? Is the gain still substantial?

li-qing

3 days ago

Nice work!

Liang0223

3 days ago

Thanks~

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Futureli

1 day ago

From now on, it's all DFT.

unilm

about 8 hours ago

•

edited about 4 hours ago

Is DFT equivalent to maximizing p(y|x) instead of log p(y|x) (SFT) ?

Liang0223

about 7 hours ago

No, I don't think so.

vkataev

about 3 hours ago

Nice approach! I would rather put this directly into Abstract to save reader's time:

DFT is a one-line change to standard SFT: scale each token’s loss by its predicted probability (detached to avoid gradient flow).
loss = loss * torch.softmax(shift_logits, dim=-1).gather(1, shift_labels.unsqueeze(-1)).squeeze(-1).detach()