ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
Alham Fikri Aji
Radityo Eko Prasojo Tirana Noor Fatyanosa
Radityo Eko Prasojo
Philip Arthur
Suci Fitriany
Salma Qonitah
Nadhifatuz Zulfa
Tomi Santoso
Mahendra Data

Abstract
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the \texttt{ParaBank2}. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.
View on arXivComments on this paper