ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
Abstract
A synthetic parallel paraphrase corpus across 17 languages is generated using monolingual data and neural machine translation, resulting in semantically similar and lexically diverse paraphrase pairs.
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the ParaBank2. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper