Commit
·
3ef9b51
1
Parent(s):
bf8d6f5
Update README.md and WARNING the user I\'m not the Author.
Browse files
README.md
CHANGED
|
@@ -1,3 +1,115 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/63ea0de943d976de6e4e54fb/-zXQ3G2iKCCAq6x8gPGm7.png" width="300" class="left"><img src="https://cdn-uploads.huggingface.co/production/uploads/63ea0de943d976de6e4e54fb/r1vY_i4DmL5shXAm_CMs9.png" width="400" class="center">
|
| 6 |
+
|
| 7 |
+
This is the Repo for the paper: [BARTScore: Evaluating Generated Text as Text Generation](https://arxiv.org/abs/2106.11520)
|
| 8 |
+
|
| 9 |
+
## Updates
|
| 10 |
+
- 2021.09.29 Paper gets accepted to NeurIPS 2021 :tada:
|
| 11 |
+
- 2021.08.18 Release code
|
| 12 |
+
- 2021.06.28 Release online evaluation [Demo](http://bartscore.sh/)
|
| 13 |
+
- 2021.06.25 Release online Explainable Leaderboard for [Meta-evaluation](http://explainaboard.nlpedia.ai/leaderboard/task-meval/index.php)
|
| 14 |
+
- 2021.06.22 Code will be released soon
|
| 15 |
+
|
| 16 |
+
## Background
|
| 17 |
+
There is a recent trend that leverages neural models for automated evaluation in different ways, as shown in Fig.1.
|
| 18 |
+
|
| 19 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/63ea0de943d976de6e4e54fb/jfRv5wmLud1uYivH4ZG6c.png" width=650 class="left">
|
| 20 |
+
|
| 21 |
+
(a) **Evaluation as matching task.** Unsupervised matching metrics aim to measure the semantic equivalence between the reference and hypothesis by using a token-level matching functions in distributed representation space (e.g. BERT) or discrete string space (e.g. ROUGE).
|
| 22 |
+
|
| 23 |
+
(b) **Evaluation as regression task.** Regression-based metrics (e.g. BLEURT) introduce a parameterized regression layer, which would be learned in a supervised fashion to accurately predict human judgments.
|
| 24 |
+
|
| 25 |
+
(c) **Evaluation as ranking task.** Ranking-based metrics (e.g. COMET) aim to learn a scoring function that assigns a higher score to better hypotheses than to worse ones.
|
| 26 |
+
|
| 27 |
+
(d) **Evaluation as generation task.** In this work, we formulate evaluating generated text as a text generation task from pre-trained language models.
|
| 28 |
+
|
| 29 |
+
## Our Work
|
| 30 |
+
Basic requirements for all the libraries are in the `requirements.txt.`
|
| 31 |
+
|
| 32 |
+
### Direct use
|
| 33 |
+
Our trained BARTScore (on ParaBank2) can be downloaded [here](https://drive.google.com/file/d/1_7JfF7KOInb7ZrxKHIigTMR4ChVET01m/view?usp=sharing). Example usage is shown below.
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
# To use the CNNDM version BARTScore
|
| 37 |
+
>>> from bart_score import BARTScorer
|
| 38 |
+
>>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
|
| 39 |
+
>>> bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4) # generation scores from the first list of texts to the second list of texts.
|
| 40 |
+
[out]
|
| 41 |
+
[-2.510652780532837]
|
| 42 |
+
|
| 43 |
+
# To use our trained ParaBank version BARTScore
|
| 44 |
+
>>> from bart_score import BARTScorer
|
| 45 |
+
>>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
|
| 46 |
+
>>> bart_scorer.load(path='bart.pth')
|
| 47 |
+
>>> bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4)
|
| 48 |
+
[out]
|
| 49 |
+
[-2.336203098297119]
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
We also provide multi-reference support. Please make sure you have the same number of references for each test sample. The usage is shown below.
|
| 53 |
+
```python
|
| 54 |
+
>>> from bart_score import BARTScorer
|
| 55 |
+
>>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='facebook/bart-large-cnn')
|
| 56 |
+
>>> srcs = ["I'm super happy today.", "This is a good idea."]
|
| 57 |
+
>>> tgts = [["I feel good today.", "I feel sad today."], ["Not bad.", "Sounds like a good idea."]] # List[List of references for each test sample]
|
| 58 |
+
>>> bart_scorer.multi_ref_score(srcs, tgts, agg="max", batch_size=4) # agg means aggregation, can be mean or max
|
| 59 |
+
[out]
|
| 60 |
+
[-2.5008113384246826, -1.626236081123352]
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
### Reproduce
|
| 65 |
+
To reproduce the results for each task, please see the `README.md` in each folder: `D2T` (data-to-text), `SUM` (summarization), `WMT` (machine translation). Once you get the scored pickle file in the right path (in each dataset folder), you can use them to conduct analysis.
|
| 66 |
+
|
| 67 |
+
For analysis, we provide `SUMStat`, `D2TStat` and `WMTStat` in `analysis.py` that can conveniently run analysis. An example of using `SUMStat` is shown below. Detailed usage can refer to `analysis.ipynb`.
|
| 68 |
+
|
| 69 |
+
```python
|
| 70 |
+
>>> from analysis import SUMStat
|
| 71 |
+
>>> stat = SUMStat('SUM/REALSumm/final_p.pkl')
|
| 72 |
+
>>> stat.evaluate_summary('litepyramid_recall')
|
| 73 |
+
|
| 74 |
+
[out]
|
| 75 |
+
Human metric: litepyramid_recall
|
| 76 |
+
metric spearman kendalltau
|
| 77 |
+
------------------------------------------------- ---------- ------------
|
| 78 |
+
rouge1_r 0.497526 0.407974
|
| 79 |
+
bart_score_cnn_hypo_ref_de_id est 0.49539 0.392728
|
| 80 |
+
bart_score_cnn_hypo_ref_de_Videlicet 0.491011 0.388237
|
| 81 |
+
...
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### Train your custom BARTScore
|
| 85 |
+
If you want to train your custom BARTScore with paired data, we provide the scripts and detailed instructions in the `train` folder. Once you got your trained model (for example, `my_bartscore` folder). You can use your custom BARTScore as shown below.
|
| 86 |
+
|
| 87 |
+
```python
|
| 88 |
+
>>> from bart_score import BARTScorer
|
| 89 |
+
>>> bart_scorer = BARTScorer(device='cuda:0', checkpoint='my_bartscore')
|
| 90 |
+
>>> bart_scorer.score(['This is interesting.'], ['This is fun.'])
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
### Notes on use
|
| 95 |
+
Since we are using the average log-likelihood for target tokens, the calculated scores will be smaller than 0 (the probability is between 0 and 1, so the log of it should be negative). The higher the log-likelihood, the higher the probability.
|
| 96 |
+
|
| 97 |
+
To give an example, if SummaryA gets a score of -1 while SummaryB gets a score of -100, this means that the model thinks SummaryA is better than summaryB.
|
| 98 |
+
## Bib
|
| 99 |
+
Please cite our work if you find it useful.
|
| 100 |
+
```
|
| 101 |
+
@inproceedings{NEURIPS2021_e4d2b6e6,
|
| 102 |
+
author = {Yuan, Weizhe and Neubig, Graham and Liu, Pengfei},
|
| 103 |
+
booktitle = {Advances in Neural Information Processing Systems},
|
| 104 |
+
editor = {M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan},
|
| 105 |
+
pages = {27263--27277},
|
| 106 |
+
publisher = {Curran Associates, Inc.},
|
| 107 |
+
title = {BARTScore: Evaluating Generated Text as Text Generation},
|
| 108 |
+
url = {https://proceedings.neurips.cc/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf},
|
| 109 |
+
volume = {34},
|
| 110 |
+
year = {2021}
|
| 111 |
+
}
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
WARNING: This isn't the original owner's repository
|
| 115 |
+
[The original repository](https://github.com/neulab/BARTScore)
|