arxiv:2510.03264

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

Published on Sep 26

· Submitted by

Shrimai Prabhumoye on Oct 7

NVIDIA

Upvote

Authors:

Syeda Nahida Akter ,

Abstract

Introducing reasoning data during pretraining significantly enhances LLM performance compared to post-training, with pretraining benefiting more from diverse data patterns while SFT benefits more from high-quality data.

AI-generated summary

The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.

View arXiv page View PDF Project page Add to collection

Community

shrimai19

Paper submitter 1 day ago

This work investigates the underexplored role of reasoning data in the pretraining phase of LLM development. Through controlled studies varying data scale, diversity, and quality, we find that front-loading reasoning data during pretraining yields lasting improvements—up to 19% average gains—that cannot be recovered through later-stage SFT. We uncover a clear asymmetry: diverse reasoning patterns most benefit pretraining, while high-quality data drives post-training success. Moreover, high-quality pretraining data exhibits latent effects, activated only during fine-tuning, whereas naïvely scaling SFT data can erode prior gains. These findings challenge the conventional separation of language modeling and reasoning, providing a principled framework for allocating reasoning data across training stages.

SieraL

Paper author about 22 hours ago

When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning?

Our new work, "Front-Loading Reasoning," challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the frontier.

The key? An asymmetric data strategy.
📝 Blog: https://research.nvidia.com/labs/adlr/Synergy/
🔗Paper: https://tinyurl.com/3tzkemtp

We find that "front-loading" reasoning data into pretraining creates a durable, compounding advantage.
📈 Stage 1 (Pretraining): +16% avg. gain out of the gate.
📈 Stage 2 (SFT): Advantage grows to +9.3% after fine-tuning.
📈 Stage 3 (RL): Finishes with a massive +19% lead on expert benchmarks.
SFT & RL amplify a strong foundation; they can't create one.

The optimal data strategy is phase-dependent:
🧠 Pretraining thrives on DIVERSITY & SCALE. A broad mix of reasoning patterns builds a robust foundation, giving an +11% boost over using only narrow, high-quality data at this stage.
🎯 SFT demands QUALITY. Fine-tuning on a small, high-quality dataset is far more effective, boosting performance by +15% over a large, mixed-quality one.

High-quality data has a surprising latent effect.
Adding a small, high-quality dataset to a diverse pretraining mix showed minimal immediate gains. But after SFT, its value was "unlocked," providing an additional +4% boost.
A deep synergy exists: pretraining can instill the potential that alignment activates.

Can a model with no reasoning in its pretraining "catch up" by getting more SFT data? No.
We doubled the SFT data for our baseline model. While it improved, it still couldn't match the performance of even the weakest reasoning-pretrained model.
A strong start is irreplaceable.

Is more data always better in SFT? No.
Our ablations show that blindly scaling SFT with mixed-quality data is actively HARMFUL.
❌ Doubling the SFT data dropped math reasoning scores by -5%.
✅ Scaling with small high quality data provides consistent gains.
SFT is for targeted refinement, not brute-force scaling.

Our work provides a principled guide for training reasoning-centric LLMs:
Don't wait: Inject reasoning data into pretraining.
Be strategic: Use DIVERSE data for pretraining, emphasize HIGH-QUALITY data for SFT.
Be careful: Avoid polluting your SFT with low-quality data.
This moves us from "more data" to a smarter, phase-aware approach.