LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

This repo provides the checkpoint of Mistral-7B-LongPO-128K in our paper "LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization".

Highlights of LongPO

Self-evolving long-context alignment without human/superior LLMs annotations.
Extending context length while keeping aligned in one stage.
No degradation on short-context capabilities.

Models and Training Data

Models	Base Model	Training Data	# Data Samples
Mistral-7B-LongPO-128K	Mistral-7B-Instruct-v0.2	HF Link	45K
Qwen2.5-7B-LongPO-128K	Qwen2.5-7B-Instruct	HF Link	32K
Mistral-7B-LongPO-256K-EXP*	Mistral-7B-LongPO-128K	HF Link	16K
Mistral-7B-LongPO-512K-EXP*	Mistral-7B-LongPO-128K	HF Link	2.5K

* indicates an experimental version (for rebuttal purposes) that may have not been fully tuned or provided with sufficient data to achieve convergence.

Evaluation

InfiniteBench

Model	Train/Claimed Length	En.Sum	En.QA	En.MC	AVG.
GPT-4-128K	128K	14.73	22.44	67.25	34.81
Qwen2-72B	128K	24.32ᵇ	7.03ᵇ	72.05ᵇ	34.47ᵇ
LLaMA 3.1-70B	128K	33.55ᵇ	36.08ᵇ	69.00ᵇ	46.21ᵇ
LLaMA 3.1-8B	128K	28.06ᵇ	30.47ᵇ	58.08ᵇ	38.87ᵇ
GLM-4-9B	128K	14.84ᵇ	9.51ᵇ	67.25ᵇ	30.53ᵇ
GLM-4-9B-1M	1M	28.3	9.7	68.6	35.53
LWM-7B-1M	1M	4.33ᵇ	0.0ᵇ	3.06ᵇ	2.46ᵇ
YaRN-Mistral-7B	128K	9.09	9.55	27.95	15.53
Mistral-7B	32K	22.13	4.93	14.41	13.82
- SFT	128K	23.44	13.45	53.21	30.03
- DPO	128K	15.21	10.34	48.14	25.56
- LongPO (iter1)	128K	27.05	23.51	67.25	39.27
- LongPO (iter2)	256K	28.16	24.43	66.35	39.65
- LongPO (iter3)	512K	29.10	27.85	66.67	41.21
Qwen2.5-7B	128K	22.89	6.08	52.4	27.12
- LongPO (iter1)	128K	32.06	17.32	72.05	40.48

Our results are evaluated with greedy decoding.
Baseline results marked with ᵇ are evaluated by us, while unmarked baseline results are sourced from their official report.

RULER

Model	NIAH	VT	AGG	QA	AVG (13 tasks)
Qwen2.5-7B-Instruct	82.10	80.09	74.50	54.30	76.50
Qwen2.5-7B-LongPO-128K	95.82	89.71	78.67	59.40	87.11
Mistral-7B-Instruct-v0.2	72.60	74.40	64.40	52.20	68.40
Mistral-7B-LongPO-128K	96.88	96.49	71.55	64.81	88.02
Mistral-7B-LongPO-256K-EXP	96.80	97.00	69.14	64.87	87.65
Mistral-7B-LongPO-512K-EXP	97.28	97.48	69.22	64.92	88.00

Short Context

Model	MMLU	ARC-C	Hellaswag	Winogrande	Avg
Mistral-7B-Instruct-v0.2	59.15	59.26	83.2	78.4	70.00
Mistral-7B-LongPO-128K	59.99	59.34	82.99	78.53	70.21
Mistral-7B-LongPO-256K-EXP	59.47	60.28	83.14	78.14	70.26
Mistral-7B-LongPO-512K-EXP	59.51	60.58	82.87	77.66	70.16
Qwen2.5-7B-Instruct	74.28	67.15	81.41	74.66	74.38
Qwen2.5-7B-LongPO-128K	73.64	65.70	80.82	74.98	73.79

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@inproceedings{
    chen2025longpo,
    title={Long{PO}: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization},
    author={Guanzheng Chen and Xin Li and Michael Shieh and Lidong Bing},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=qTrEq31Shm}
}

DAMO-NLP-SG
/

Mistral-7B-LongPO-128K