The model has been trained to predict for English sentences, whether they are formal or informal.
Base model: roberta-base
Datasets: GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016.
Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.
Loss: binary classification (on GYAFC), in-batch ranking (on PT data).
Performance metrics on the test data:
dataset | ROC AUC | precision | recall | fscore | accuracy | Spearman |
---|---|---|---|---|---|---|
GYAFC | 0.9779 | 0.90 | 0.91 | 0.90 | 0.9087 | 0.8233 |
GYAFC normalized (lowercase + remove punct.) | 0.9234 | 0.85 | 0.81 | 0.82 | 0.8218 | 0.7294 |
P&T subset | Spearman R |
---|---|
news | 0.4003 |
answers | 0.7500 |
blog | 0.7334 |
0.7606 |
Citation
If you are using the model in your research, please cite the following paper where it was introduced:
@InProceedings{10.1007/978-3-031-35320-8_4,
author="Babakov, Nikolay
and Dale, David
and Gusev, Ilya
and Krotova, Irina
and Panchenko, Alexander",
editor="M{\'e}tais, Elisabeth
and Meziane, Farid
and Sugumaran, Vijayan
and Manning, Warren
and Reiff-Marganiec, Stephan",
title="Don't Lose the Message While Paraphrasing: A Study on Content Preserving Style Transfer",
booktitle="Natural Language Processing and Information Systems",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="47--61",
isbn="978-3-031-35320-8"
}
Licensing Information
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- Downloads last month
- 1,148