haining
/

scientific_abstract_simplification

@@ -107,22 +107,35 @@ print(tokenizer.decode(decoded_ids[0], skip_special_tokens=True))
 ## Data
-TBA.
-<!-- For SAS-baseline, we finetuned Flan-T5 model with the Scientific Abstract-Significance (SAS) corpus.
-| Scientific Abstract-Significance | # Training/Dev/Test Samples | # Training Tokens | # Validation Tokens | # Test Tokens | Automated Readability Index (std.) |
-|----------------------------------|-----------------------------|-------------------|---------------------|---------------|------------------------------------|
-| Abstract                         | 3030/200/200                | 707,071            | 45,697               | 46,985         | 18.68 (2.85)                       |
-| Significance                     | 3030/200/200                | 375,433            | 24,901               | 24,426         | 17.89 (3.05)                       |
- -->
 ## Setup
-TBA.
-<!-- We finetuned the base model with a standard language modeling objective: the abstracts are sources and the significance statements are targets. We inform the model with a task-spcific prefix ("summarize, simplify, and contextualize: ") during training. The training took roughly 9 hours on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy. The model (\~780M parameter) was trained on Nov. 20, 2022.
-Notice, the readability of the significance statements is generally lower than the abstracts', but not by a large margin. Our incoming SAS-full model will leverage more corpora for scientific (re)contextualization, summarization, and simplification. -->
 # Evaluation

 ## Data
+| Corpus                           | # Training/Dev/Test Samples | # Training Tokens (source, target) | # Validation Tokens (source, target) | # Test Tokens (source, target) | Note                                                                                                                                   |
+|----------------------------------|-----------------------------|------------------------------------|--------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
+| Scientific Abstract-Significance | 3030/200/200                | 707071, 375433                     | 45697, 24901                         | 46985, 24426                   |                                                                                                                                        |
+| Editor Abstract                  | 732/91/92                   | 154808, 194721                     | 19675, 24421                         | 19539, 24332                   |                                                                                                                                        |
+| Wiki Auto                        | 28364/1000/1000             | 18239990, 12547272                 | 643157, 444034                       | 642549, 444883                 | We used the ACL version, adopted from Huggingface datasets. The validation and test samples are split from the corpus and kept frozen. |
+| CNN/DailyMail                    | 287113/13368/11490          | -                                  | -                                    | -                              | We used the 2.0 version, adopted from Huggingface datasets.                                                                            |
+               |
 ## Setup
+We finetuned the base model (flan-t5-large) on multiple relevant tasks with standard language modeling loss. During training, the source text of each task is prepended with an task-specific instruction and mapped to the corresponding target text. For example, "simplify: " is added before a wiki text, and the whole text is fed into the model to line up with the corresponding simple wiki text. The tuning process has two steps.
+| Task                               | Corpus                           | Instruction                                | Optimal samples |
+|------------------------------------|----------------------------------|--------------------------------------------|-----------------|
+| Scientific Abstract Simplification | Scientific Abstract-Significance | "summarize, simplify, and contextualize: " | 39200           |
+| Recontextualization                | Editor Abstract                  | "contextualize: "                          | 2200            |
+| Simplification                     | Wiki Auto                        | "simplify: "                               | 57000           |
+| Summarization                      | CNN/DailyMail                    | "summarize: "                              | 165000          |
+|------------------------------------|----------------------------------|--------------------------------------------|-----------------|
+| Total                              | Challenge-proportional Mixture   | n/a                                        | 263400          |
+- Multi-instruction tuning: In the stage, we first created a task mixture using "challenge-proportional mixing" method. In a seperate pilot studie, for each task, we finetuned it on a base model and observed the number of samples when validation loss starts to rise. We mixed the samples of each task proportional to its optimal number of samples. A corpus is exhausted before upsampling if the number of total samples is smaller than its optimal number. We finetune with the task mixture (263,400 samples) with the aforementioned template.
+- Retuning: In this stage, we continued finetuning the checkpoint solely with the Scientific Abstract-Significance corpus till optimal validation loss was observed.
+The multi-instruction tuning and the retuning took roughly 63 hours and 8 hours, respectively, on two NVIDIA RTX A5000 (24GB memory each) GPUs. We saved the checkpoint with the lowest validation loss for inference. We used the AdamW optimizer and a learning rate of 3e-5 with fully sharded data parallel strategy across training stages.
 # Evaluation