Abstract
Synthetic Bootstrapped Pretraining (SBP) enhances language model performance by learning inter-document correlations and synthesizing new training data, leading to significant improvements over standard pretraining methods.
We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers a significant fraction of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.
Community
Unveiling Synthetic Bootstrapped Pretraining (SBP), a synthetic pretraining method that doesn’t rely on teacher distillation.
- SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed.
- Validated on 1T tokens with a 3B-parameter model trained from scratch.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining (2025)
- Learning Facts at Scale with Active Reading (2025)
- Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime (2025)
- Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling (2025)
- MLP Memory: Language Modeling with Retriever-pretrained External Memory (2025)
- Large-Scale Diverse Synthesis for Mid-Training (2025)
- Patent Language Model Pretraining with ModernBERT (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper