Learning Dynamics in Continual Pre-Training for Large Language Models
Abstract
Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.
Community
Learning dynamics in continual pretraining for LLMs (ICML2025 spotlight).
We find an accurate law to trace the performance of continual pretraining with many variables (e.g., learning rate, loss potential, training steps, replay ratio, etc.)
Many interesting and insightful findings here!
Welcome any feedback, comment, and discussion!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules (2025)
- Large Language Model Empowered Recommendation Meets All-domain Continual Pre-Training (2025)
- Domain-Adaptive Continued Pre-Training of Small Language Models (2025)
- LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection (2025)
- Overtrained Language Models Are Harder to Fine-Tune (2025)
- Compression Laws for Large Language Models (2025)
- SkyLadder: Better and Faster Pretraining via Context Window Scheduling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper