LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
This repo provides the checkpoint of Mistral-7B-LongPO-512K in our paper "LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization".
(Note that it is an experimental an experimental version (for rebuttal purposes) that may have not been fully tuned or provided with sufficient data to achieve convergence.)
Highlights of LongPO
- Self-evolving long-context alignment without human/superior LLMs annotations.
- Extending context length while keeping aligned in one stage.
- No degradation on short-context capabilities.
Models and Training Data
Models | Base Model | Training Data | # Data Samples |
---|---|---|---|
Mistral-7B-LongPO-128K | Mistral-7B-Instruct-v0.2 | HF Link | 45K |
Qwen2.5-7B-LongPO-128K | Qwen2.5-7B-Instruct | HF Link | 32K |
Mistral-7B-LongPO-256K-EXP* | Mistral-7B-LongPO-128K | HF Link | 16K |
Mistral-7B-LongPO-512K-EXP* | Mistral-7B-LongPO-128K | HF Link | 2.5K |
* indicates an experimental version (for rebuttal purposes) that may have not been fully tuned or provided with sufficient data to achieve convergence.
Evaluation
InfiniteBench
Model | Train/Claimed Length | En.Sum | En.QA | En.MC | AVG. |
---|---|---|---|---|---|
GPT-4-128K | 128K | 14.73 | 22.44 | 67.25 | 34.81 |
Qwen2-72B | 128K | 24.32ᵇ | 7.03ᵇ | 72.05ᵇ | 34.47ᵇ |
LLaMA 3.1-70B | 128K | 33.55ᵇ | 36.08ᵇ | 69.00ᵇ | 46.21ᵇ |
LLaMA 3.1-8B | 128K | 28.06ᵇ | 30.47ᵇ | 58.08ᵇ | 38.87ᵇ |
GLM-4-9B | 128K | 14.84ᵇ | 9.51ᵇ | 67.25ᵇ | 30.53ᵇ |
GLM-4-9B-1M | 1M | 28.3 | 9.7 | 68.6 | 35.53 |
LWM-7B-1M | 1M | 4.33ᵇ | 0.0ᵇ | 3.06ᵇ | 2.46ᵇ |
YaRN-Mistral-7B | 128K | 9.09 | 9.55 | 27.95 | 15.53 |
Mistral-7B | 32K | 22.13 | 4.93 | 14.41 | 13.82 |
- SFT | 128K | 23.44 | 13.45 | 53.21 | 30.03 |
- DPO | 128K | 15.21 | 10.34 | 48.14 | 25.56 |
- LongPO (iter1) | 128K | 27.05 | 23.51 | 67.25 | 39.27 |
- LongPO (iter2) | 256K | 28.16 | 24.43 | 66.35 | 39.65 |
- LongPO (iter3) | 512K | 29.10 | 27.85 | 66.67 | 41.21 |
Qwen2.5-7B | 128K | 22.89 | 6.08 | 52.4 | 27.12 |
- LongPO (iter1) | 128K | 32.06 | 17.32 | 72.05 | 40.48 |
- Our results are evaluated with greedy decoding.
- Baseline results marked with ᵇ are evaluated by us, while unmarked baseline results are sourced from their official report.
RULER
Model | NIAH | VT | AGG | QA | AVG (13 tasks) |
---|---|---|---|---|---|
Qwen2.5-7B-Instruct | 82.10 | 80.09 | 74.50 | 54.30 | 76.50 |
Qwen2.5-7B-LongPO-128K | 95.82 | 89.71 | 78.67 | 59.40 | 87.11 |
Mistral-7B-Instruct-v0.2 | 72.60 | 74.40 | 64.40 | 52.20 | 68.40 |
Mistral-7B-LongPO-128K | 96.88 | 96.49 | 71.55 | 64.81 | 88.02 |
Mistral-7B-LongPO-256K-EXP | 96.80 | 97.00 | 69.14 | 64.87 | 87.65 |
Mistral-7B-LongPO-512K-EXP | 97.28 | 97.48 | 69.22 | 64.92 | 88.00 |
Short Context
Model | MMLU | ARC-C | Hellaswag | Winogrande | Avg |
---|---|---|---|---|---|
Mistral-7B-Instruct-v0.2 | 59.15 | 59.26 | 83.2 | 78.4 | 70.00 |
Mistral-7B-LongPO-128K | 59.99 | 59.34 | 82.99 | 78.53 | 70.21 |
Mistral-7B-LongPO-256K-EXP | 59.47 | 60.28 | 83.14 | 78.14 | 70.26 |
Mistral-7B-LongPO-512K-EXP | 59.51 | 60.58 | 82.87 | 77.66 | 70.16 |
Qwen2.5-7B-Instruct | 74.28 | 67.15 | 81.41 | 74.66 | 74.38 |
Qwen2.5-7B-LongPO-128K | 73.64 | 65.70 | 80.82 | 74.98 | 73.79 |
Citation
If you find our project useful, hope you can star our repo and cite our paper as follows:
@inproceedings{
chen2025longpo,
title={Long{PO}: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization},
author={Guanzheng Chen and Xin Li and Michael Shieh and Lidong Bing},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=qTrEq31Shm}
}
- Downloads last month
- 12
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for DAMO-NLP-SG/Mistral-7B-LongPO-512K-EXP
Base model
mistralai/Mistral-7B-Instruct-v0.2
Finetuned
DAMO-NLP-SG/Mistral-7B-LongPO-128K