Papers
arxiv:2510.03264

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

Published on Sep 26
· Submitted by Shrimai Prabhumoye on Oct 7
Authors:
,
,
,
,
,

Abstract

Introducing reasoning data during pretraining significantly enhances LLM performance compared to post-training, with pretraining benefiting more from diverse data patterns while SFT benefits more from high-quality data.

AI-generated summary

The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.

Community

Paper submitter

This work investigates the underexplored role of reasoning data in the pretraining phase of LLM development. Through controlled studies varying data scale, diversity, and quality, we find that front-loading reasoning data during pretraining yields lasting improvements—up to 19% average gains—that cannot be recovered through later-stage SFT. We uncover a clear asymmetry: diverse reasoning patterns most benefit pretraining, while high-quality data drives post-training success. Moreover, high-quality pretraining data exhibits latent effects, activated only during fine-tuning, whereas naïvely scaling SFT data can erode prior gains. These findings challenge the conventional separation of language modeling and reasoning, providing a principled framework for allocating reasoning data across training stages.

Paper author

When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning?

Our new work, "Front-Loading Reasoning," challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the frontier.

The key? An asymmetric data strategy.
📝 Blog: https://research.nvidia.com/labs/adlr/Synergy/
🔗Paper: https://tinyurl.com/3tzkemtp

rlvr

We find that "front-loading" reasoning data into pretraining creates a durable, compounding advantage.
📈 Stage 1 (Pretraining): +16% avg. gain out of the gate.
📈 Stage 2 (SFT): Advantage grows to +9.3% after fine-tuning.
📈 Stage 3 (RL): Finishes with a massive +19% lead on expert benchmarks.
SFT & RL amplify a strong foundation; they can't create one.

base

sft

The optimal data strategy is phase-dependent:
🧠 Pretraining thrives on DIVERSITY & SCALE. A broad mix of reasoning patterns builds a robust foundation, giving an +11% boost over using only narrow, high-quality data at this stage.
🎯 SFT demands QUALITY. Fine-tuning on a small, high-quality dataset is far more effective, boosting performance by +15% over a large, mixed-quality one.

High-quality data has a surprising latent effect.
Adding a small, high-quality dataset to a diverse pretraining mix showed minimal immediate gains. But after SFT, its value was "unlocked," providing an additional +4% boost.
A deep synergy exists: pretraining can instill the potential that alignment activates.

latent_synergy

Can a model with no reasoning in its pretraining "catch up" by getting more SFT data? No.
We doubled the SFT data for our baseline model. While it improved, it still couldn't match the performance of even the weakest reasoning-pretrained model.
A strong start is irreplaceable.

catchup_test

Is more data always better in SFT? No.
Our ablations show that blindly scaling SFT with mixed-quality data is actively HARMFUL.
❌ Doubling the SFT data dropped math reasoning scores by -5%.
✅ Scaling with small high quality data provides consistent gains.
SFT is for targeted refinement, not brute-force scaling.

Our work provides a principled guide for training reasoning-centric LLMs:
Don't wait: Inject reasoning data into pretraining.
Be strategic: Use DIVERSE data for pretraining, emphasize HIGH-QUALITY data for SFT.
Be careful: Avoid polluting your SFT with low-quality data.
This moves us from "more data" to a smarter, phase-aware approach.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.03264 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.03264 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.03264 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.