ffgcc
commited on
Commit
·
e88547e
1
Parent(s):
8f1417d
readme
Browse files
README.md
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions
|
| 2 |
+
|
| 3 |
+
This repository contains the code, models and datasets for our paper [LongMagpie: A Self-synthesis Method for Generating Large-scale Long-context Instructions].
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
## Quick Links
|
| 7 |
+
|
| 8 |
+
- [Overview](#overview)
|
| 9 |
+
- [LongMagpie Models](#LongMagpie-models)
|
| 10 |
+
- [LongMagpie Datasets](#LongMagpie-datasets)
|
| 11 |
+
- [Datasets list](#datasets-list)
|
| 12 |
+
- [Train Llama-3-8B-LongMagpie-512K-Instruct](#train-LongMagpie512K)
|
| 13 |
+
- [Requirements](#requirements)
|
| 14 |
+
- [Evaluation](#evaluation)
|
| 15 |
+
- [Build your long-context instruction data](#build-long-data)
|
| 16 |
+
- [Bugs or Questions?](#bugs-or-questions)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
<a id="overview"></a>
|
| 20 |
+
|
| 21 |
+
## Overview
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
High-quality long-context instruction data is essential for aligning long-context large language models (LLMs). Despite the public release of models like Qwen and Llama, their long-context instruction data remains proprietary. Human annotation is costly and challenging, while template-based synthesis methods limit scale, diversity, and quality. We introduce LongMagpie, a self-synthesis framework that automatically generates large-scale long-context instruction data. Our key insight is that aligned long-context LLMs, when presented with a document followed by special tokens preceding a user turn, auto-regressively generate contextually relevant queries. By harvesting these document-query pairs and the model's responses, LongMagpie produces high-quality instructions without human effort. Experiments on HELMET, RULER, and Longbench v2 demonstrate that LongMagpie achieves leading performance on long-context tasks while maintaining competitive performance on short-context tasks, establishing it as a simple and effective approach for open, diverse, and scalable long-context instruction data synthesis.
|
| 25 |
+
|
| 26 |
+
<div style="text-align: center;">
|
| 27 |
+
<img src="figure/LongMagpie.png" width="700" height="350">
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
<a id="LongMagpie-models"></a>
|
| 31 |
+
|
| 32 |
+
## LongMagpie Models
|
| 33 |
+
|
| 34 |
+
Our released models are listed as follows. You can import these models by using [HuggingFace's Transformers](https://github.com/huggingface/transformers). All models are trained on long-context instruction data synthesized by [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) model. In the following comparision, we choose [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) as a baseline model, which is trained with [Magpie instruction data](https://huggingface.co/datasets/Magpie-Align/Magpie-Llama-3.3-Pro-1M-v0.1). In addition, to maintain short-text performance, we propose a p-mix strategy that combines LongMagpie and [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) datasets, resulting in a performance-balanced model [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct).
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
#### The performance on [HELMET](https://github.com/princeton-nlp/HELMET) and [RULER](https://github.com/NVIDIA/RULER)
|
| 38 |
+
|
| 39 |
+
| Model | RULER Avg. | HELMET Avg. | HELMET Recall | HELMET RAG | HELMET ICL | HELMET Re-rank | HELMET LongQA |
|
| 40 |
+
|:-------------------------------|:-------:|:-------:|:------:|:-----:|:-----:|:-------:|:------:|
|
| 41 |
+
| [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | 88.00 | 59.92 | **98.63** | 62.70 | 81.00 | 26.41 | 30.89 |
|
| 42 |
+
| [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | **91.17** | 62.10 | 97.53 | 63.37 | **85.84** | 28.60 | 35.16 |
|
| 43 |
+
| [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | 89.70 | **62.11** | 95.96 | **64.17** | 85.12 | **29.61** | **35.71** |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
#### The performance on [Longbench V2](https://github.com/THUDM/LongBench)
|
| 48 |
+
|
| 49 |
+
| Model | Overall (%) | Easy (%) | Hard (%) | Short (%) | Medium (%) | Long (%) |
|
| 50 |
+
|--------------------------------------------|-------------|----------|----------|-----------|------------|----------|
|
| 51 |
+
| [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | 30.8 | 33.9 | 28.9 | 37.8 | 27.4 | **25.9** |
|
| 52 |
+
| [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | **34.4**| **38.5** |**31.8**| **41.7** |33 |25 |
|
| 53 |
+
| [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | 33 | 35.9 |31.2 |37.2 |**34.9**| 22.2 |
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
#### The performance on Short-context Benchmarks
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
| Model | Avg. | Hel. | Lam. | AR-C. | AR-E. | PIQA | Win. | Logiqa | MMLU |
|
| 66 |
+
|----------------------------|-------|-----------|----------------|---------------|----------|-------|------------|--------|-------|
|
| 67 |
+
| [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 0.6332 | 0.5773 | 0.7171 | 0.5316 | 0.8165 | 0.7889 | 0.7198 | 0.2765 | 0.6376 |
|
| 68 |
+
| [Llama-3-8B-NExtLong-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-NExtLong-512K-Instruct) | **0.6410** | **0.5953** | 0.7242 | 0.5188 | 0.8224 | **0.8079** | 0.7324 | **0.3041** | 0.6232 |
|
| 69 |
+
| [Llama-3-8B-LongMagpie-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-512K-Instruct) | 0.6237 |0.5803 |0.7025 |0.4804| 0.8047| 0.7938 |0.7293| 0.278 |0.6209 |
|
| 70 |
+
| [Llama-3-8B-LongMagpie-p-mix-512K-Instruct](https://huggingface.co/caskcsg/Llama-3-8B-LongMagpie-p-mix-512K-Instruct) | **0.6410** | 0.5893 | **0.7355**| **0.5282**| **0.8279**| 0.8052| **0.734**| 0.2842| **0.6236** |
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
<a id="LongMagpie-datasets"></a>
|
| 76 |
+
|
| 77 |
+
## LongMagpie Datasets
|
| 78 |
+
|
| 79 |
+
<a id="datasets-list"></a>
|
| 80 |
+
|
| 81 |
+
### Datasets list
|
| 82 |
+
|
| 83 |
+
Our released datasets are listed as follows. All datasets are synthesized from the short-text datasets [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
| Dataset | Description |
|
| 88 |
+
|:-------------------------------|:--------|
|
| 89 |
+
| [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset) | Our synthesized 450k raw text files(refer to [infer_demo.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/infer_demo.py)). Each line of data contains context extracted from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), query generated by LongMapgie and answer. |
|
| 90 |
+
| [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) | Based on [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset), we used the MultiDoc method (refer to [multidoc_format.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/multidoc_format.py)) to extend the context length and transformed it into SFT dialogue format. |
|
| 91 |
+
| [LongMagpie_64k_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_64k_dataset) | We tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and concatenated it to a length of 64k (refer to [concat script](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data.py)), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance. |
|
| 92 |
+
| [LongMagpie_p-mix_64k_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_p-mix_64k_dataset) | To maintain short-text performance, we tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and mixed it with [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) using the p-mix strategy, concatenating to a length of 64k (refer to [p-mix.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data_p_mix.py)). This dataset can be used to achieve balanced long and short text performance. |
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
<a id="train-LongMagpie512K"></a>
|
| 96 |
+
|
| 97 |
+
## Train Llama-3-8B-LongMagpie-512K-Instruct
|
| 98 |
+
|
| 99 |
+
<a id="requirements"></a>
|
| 100 |
+
|
| 101 |
+
### Requirements
|
| 102 |
+
|
| 103 |
+
Run the following script to install the remaining dependencies and train the model.
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
pip install -r requirements.txt
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Train
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
bash train_sft.sh
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
<a id="evaluation"></a>
|
| 117 |
+
|
| 118 |
+
## Evaluation
|
| 119 |
+
|
| 120 |
+
Refer to the [HELMET](https://github.com/princeton-nlp/HELMET), [RULER](https://github.com/NVIDIA/RULER), and [Longbench V2](https://github.com/THUDM/LongBench) to evaluate the Instruct model.
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
<a id="build-long-data"></a>
|
| 124 |
+
|
| 125 |
+
## Build your long-context instruction data
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
### 1. Synthesizing Single-Document Q&A Data
|
| 129 |
+
|
| 130 |
+
Refer to [infer_demo.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/infer_demo.py). Each line of data contains context extracted from [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), query generated by LongMapgie and answer.
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
```bash
|
| 134 |
+
python longmagpie/infer_demo.py
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
### 2. Synthesizing Multi-Document Q&A Data
|
| 138 |
+
|
| 139 |
+
Based on [LongMagpie_singledoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_singledoc_longcontext_dataset), we used the MultiDoc method (refer to [multidoc_format.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/multidoc_format.py)) to extend the context length and transformed it into SFT dialogue format.
|
| 140 |
+
|
| 141 |
+
```bash
|
| 142 |
+
python longmagpie/multidoc_format.py
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
### 3. Dataset Concatenation
|
| 147 |
+
|
| 148 |
+
Following [ProLong](https://github.com/princeton-nlp/ProLong), we concatenate the datasets to a fixed 64k context length and train using Document Mask technology.
|
| 149 |
+
|
| 150 |
+
#### 3.1 Concatenating Document Q&A Datasets Only
|
| 151 |
+
|
| 152 |
+
We tokenized [LongMagpie_multidoc_longcontext_dataset](https://huggingface.co/datasets/caskcsg/LongMagpie_multidoc_longcontext_dataset) and concatenated it to a length of 64k (refer to [build_sft_data.py](https://github.com/caskcsg/longcontext/tree/main/LongMagpie/longmagpie/build_sft_data.py)), making it convenient to train using Document Mask technology. This dataset can be used to achieve the best long-text performance.
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
python longmagpie/build_sft_data.py
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
#### 3.2 Using p-mix Strategy
|
| 159 |
+
|
| 160 |
+
To balance these capabilities, we introduce \textit{p}-Mix, a novel instruction data hybridization strategy. The core idea is twofold. First, to emulate the typical non-contextual start of general tasks, we sample a short-context instruction at the beginning of each training sequence. Second, we append subsequent data segments probabilistically to construct a mixed-context sequence up to length $L_{max}$. With probability $P_L$, a long-context instruction (generated by LongMagpie) is chosen; otherwise, with probability $1-P_L$, another short-context sample is chosen. This process repeats until approaching the target sequence length, ensuring each instance starts with a short, context-free instruction followed by a dynamically mixed sequence of long and short segments.
|
| 161 |
+
|
| 162 |
+
```bash
|
| 163 |
+
python longmagpie/build_sft_data_p_mix.py
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
<a id="bugs-or-questions"></a>
|
| 168 |
+
|
| 169 |
+
## Bugs or questions?
|
| 170 |
+
|
| 171 |
+
If you have any questions related to the code or the paper, feel free to email Chaochen (`[email protected]`) and XingWu (`[email protected]`). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
|
| 172 |
+
|
| 173 |
+
<!-- ## Citation
|
| 174 |
+
|
| 175 |
+
Please cite our paper if you use LongMagpie in your work:
|
| 176 |
+
|
| 177 |
+
```bibtex
|
| 178 |
+
|
| 179 |
+
``` -->
|