Update README.md
Browse files
README.md
CHANGED
@@ -107,22 +107,35 @@ print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))
|
|
107 |
|
108 |
## Data
|
109 |
|
110 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
|
112 |
-
<!-- For SAS-baseline, we finetuned Flan-T5 model with the Scientific Abstract-Significance (SAS) corpus.
|
113 |
-
|
114 |
-
| Scientific Abstract-Significance | # Training/Dev/Test Samples | # Training Tokens | # Validation Tokens | # Test Tokens | Automated Readability Index (std.) |
|
115 |
-
|----------------------------------|-----------------------------|-------------------|---------------------|---------------|------------------------------------|
|
116 |
-
| Abstract | 3030/200/200 | 707,071 | 45,697 | 46,985 | 18.68 (2.85) |
|
117 |
-
| Significance | 3030/200/200 | 375,433 | 24,901 | 24,426 | 17.89 (3.05) |
|
118 |
-
-->
|
119 |
|
120 |
|
121 |
## Setup
|
122 |
|
123 |
-
|
124 |
-
|
125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
|
127 |
|
128 |
# Evaluation
|
|
|
107 |
|
108 |
## Data
|
109 |
|
110 |
+
| Corpus | # Training/Dev/Test Samples | # Training Tokens (source, target) | # Validation Tokens (source, target) | # Test Tokens (source, target) | Note |
|
111 |
+
|----------------------------------|-----------------------------|------------------------------------|--------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
|
112 |
+
| Scientific Abstract-Significance | 3030/200/200 | 707071, 375433 | 45697, 24901 | 46985, 24426 | |
|
113 |
+
| Editor Abstract | 732/91/92 | 154808, 194721 | 19675, 24421 | 19539, 24332 | |
|
114 |
+
| Wiki Auto | 28364/1000/1000 | 18239990, 12547272 | 643157, 444034 | 642549, 444883 | We used the ACL version, adopted from Huggingface datasets. The validation and test samples are split from the corpus and kept frozen. |
|
115 |
+
| CNN/DailyMail | 287113/13368/11490 | - | - | - | We used the 2.0 version, adopted from Huggingface datasets. |
|
116 |
+
|
|
117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
118 |
|
119 |
|
120 |
## Setup
|
121 |
|
122 |
+
We finetuned the base model (flan-t5-large) on multiple relevant tasks with standard language modeling loss. During training, the source text of each task is prepended with an task-specific instruction and mapped to the corresponding target text. For example, "simplify: " is added before a wiki text, and the whole text is fed into the model to line up with the corresponding simple wiki text. The tuning process has two steps.
|
123 |
+
|
124 |
+
| Task | Corpus | Instruction | Optimal samples |
|
125 |
+
|------------------------------------|----------------------------------|--------------------------------------------|-----------------|
|
126 |
+
| Scientific Abstract Simplification | Scientific Abstract-Significance | "summarize, simplify, and contextualize: " | 39200 |
|
127 |
+
| Recontextualization | Editor Abstract | "contextualize: " | 2200 |
|
128 |
+
| Simplification | Wiki Auto | "simplify: " | 57000 |
|
129 |
+
| Summarization | CNN/DailyMail | "summarize: " | 165000 |
|
130 |
+
|------------------------------------|----------------------------------|--------------------------------------------|-----------------|
|
131 |
+
| Total | Challenge-proportional Mixture | n/a | 263400 |
|
132 |
+
|
133 |
+
|
134 |
+
- Multi-instruction tuning: In the stage, we first created a task mixture using "challenge-proportional mixing" method. In a seperate pilot studie, for each task, we finetuned it on a base model and observed the number of samples when validation loss starts to rise. We mixed the samples of each task proportional to its optimal number of samples. A corpus is exhausted before upsampling if the number of total samples is smaller than its optimal number. We finetune with the task mixture (263,400 samples) with the aforementioned template.
|
135 |
+
|
136 |
+
- Retuning: In this stage, we continued finetuning the checkpoint solely with the Scientific Abstract-Significance corpus till optimal validation loss was observed.
|
137 |
+
|
138 |
+
The multi-instruction tuning and the retuning took roughly 63 hours and 8 hours, respectively, on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy across training stages.
|
139 |
|
140 |
|
141 |
# Evaluation
|