LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

This repo provides the checkpoint of Mistral-7B-LongPO-128K in our paper "LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization".

arXiv hf_paper

Highlights of LongPO

  • Self-evolving long-context alignment without human/superior LLMs annotations.
  • Extending context length while keeping aligned in one stage.
  • No degradation on short-context capabilities.
image

Models and Training Data

Models Base Model Training Data # Data Samples
Mistral-7B-LongPO-128K Mistral-7B-Instruct-v0.2 HF Link 45K
Qwen2.5-7B-LongPO-128K Qwen2.5-7B-Instruct HF Link 32K
Mistral-7B-LongPO-256K-EXP* Mistral-7B-LongPO-128K HF Link 16K
Mistral-7B-LongPO-512K-EXP* Mistral-7B-LongPO-128K HF Link 2.5K

* indicates an experimental version (for rebuttal purposes) that may have not been fully tuned or provided with sufficient data to achieve convergence.

Evaluation

InfiniteBench

Model Train/Claimed Length En.Sum En.QA En.MC AVG.
GPT-4-128K 128K 14.73 22.44 67.25 34.81
Qwen2-72B 128K 24.32ᵇ 7.03ᵇ 72.05ᵇ 34.47ᵇ
LLaMA 3.1-70B 128K 33.55ᵇ 36.08ᵇ 69.00ᵇ 46.21ᵇ
LLaMA 3.1-8B 128K 28.06ᵇ 30.47ᵇ 58.08ᵇ 38.87ᵇ
GLM-4-9B 128K 14.84ᵇ 9.51ᵇ 67.25ᵇ 30.53ᵇ
GLM-4-9B-1M 1M 28.3 9.7 68.6 35.53
LWM-7B-1M 1M 4.33ᵇ 0.0ᵇ 3.06ᵇ 2.46ᵇ
YaRN-Mistral-7B 128K 9.09 9.55 27.95 15.53
Mistral-7B 32K 22.13 4.93 14.41 13.82
- SFT 128K 23.44 13.45 53.21 30.03
- DPO 128K 15.21 10.34 48.14 25.56
- LongPO (iter1) 128K 27.05 23.51 67.25 39.27
- LongPO (iter2) 256K 28.16 24.43 66.35 39.65
- LongPO (iter3) 512K 29.10 27.85 66.67 41.21
Qwen2.5-7B 128K 22.89 6.08 52.4 27.12
- LongPO (iter1) 128K 32.06 17.32 72.05 40.48
  • Our results are evaluated with greedy decoding.
  • Baseline results marked with ᵇ are evaluated by us, while unmarked baseline results are sourced from their official report.

RULER

Model NIAH VT AGG QA AVG (13 tasks)
Qwen2.5-7B-Instruct 82.10 80.09 74.50 54.30 76.50
Qwen2.5-7B-LongPO-128K 95.82 89.71 78.67 59.40 87.11
Mistral-7B-Instruct-v0.2 72.60 74.40 64.40 52.20 68.40
Mistral-7B-LongPO-128K 96.88 96.49 71.55 64.81 88.02
Mistral-7B-LongPO-256K-EXP 96.80 97.00 69.14 64.87 87.65
Mistral-7B-LongPO-512K-EXP 97.28 97.48 69.22 64.92 88.00

Short Context

Model MMLU ARC-C Hellaswag Winogrande Avg
Mistral-7B-Instruct-v0.2 59.15 59.26 83.2 78.4 70.00
Mistral-7B-LongPO-128K 59.99 59.34 82.99 78.53 70.21
Mistral-7B-LongPO-256K-EXP 59.47 60.28 83.14 78.14 70.26
Mistral-7B-LongPO-512K-EXP 59.51 60.58 82.87 77.66 70.16
Qwen2.5-7B-Instruct 74.28 67.15 81.41 74.66 74.38
Qwen2.5-7B-LongPO-128K 73.64 65.70 80.82 74.98 73.79

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@inproceedings{
    chen2025longpo,
    title={Long{PO}: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization},
    author={Guanzheng Chen and Xin Li and Michael Shieh and Lidong Bing},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=qTrEq31Shm}
}
Downloads last month
16
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for DAMO-NLP-SG/Mistral-7B-LongPO-128K

Finetuned
(926)
this model
Finetunes
2 models

Dataset used to train DAMO-NLP-SG/Mistral-7B-LongPO-128K