Papers
arxiv:2503.15450

SkyLadder: Better and Faster Pretraining via Context Window Scheduling

Published on Mar 19
ยท Submitted by SivilTaram on Mar 20
Authors:
,
,
,

Abstract

Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.

Community

Paper author Paper submitter

The evolution of language models has seen increasing context window sizes. ๐Ÿ“ˆ Early models like GPT and BERT had a limit of 512 tokens, while GPT-2 expanded this to 1024. ๐Ÿš€ Llama models pushed it further: Llama (2048), Llama-2 (4096), and Llama-3 (8192).

This expansion aims to enhance model performance by reducing document truncation and maintaining coherence. ๐Ÿ“š However, our research challenges the belief that larger context windows improve performance. In controlled experiments, we found that models with shorter context windows consistently outperformed those with longer ones across popular benchmarks. ๐Ÿ”

Inspired by this, we propose SkyLadder to benefit from short-context pretraining via context window scheduling! It is both faster and better for pretrianing, which brings at most 3.6% performance improvement and 22% speed up!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.15450 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.15450 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.15450 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.