WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

10 July 2019

Francisco Guzmán

Papers citing "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia"

25 / 225 papers shown

Title
Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords Tom Kocmi Martin Popel Ondrej Bojar 11 38 0 06 Jul 2020
TICO-19: the Translation Initiative for Covid-19 Antonios Anastasopoulos A. Cattelan Zi-Yi Dou Marcello Federico C. Federman ... Mengmeng Niu A. Oktem Eric Paquin G. Tang Sylwia Tur 24 90 0 03 Jul 2020
Unsupervised Quality Estimation for Neural Machine Translation M. Fomicheva Shuo Sun Lisa Yankovskaya Frédéric Blain Francisco Guzmán Mark Fishel Nikolaos Aletras Vishrav Chaudhary Lucia Specia UQLM 20 184 0 21 May 2020
Parallel Corpus Filtering via Pre-trained Language Models Boliang Zhang Ajay Nagesh Kevin Knight 30 31 0 13 May 2020
Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction C. España-Bonet Alberto Barrón-Cedeño Lluís Marquez 11 9 0 03 May 2020
Predicting Performance for Natural Language Processing Tasks Mengzhou Xia Antonios Anastasopoulos Ruochen Xu Yiming Yang Graham Neubig 25 59 0 02 May 2020
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases Louis Martin Angela Fan Eric Villemonte de la Clergerie Antoine Bordes Benoît Sagot 28 36 0 01 May 2020
A Call for More Rigor in Unsupervised Cross-lingual Learning Mikel Artetxe Sebastian Ruder Dani Yogatama Gorka Labaka Eneko Agirre 18 72 0 30 Apr 2020
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing Brian Thompson Matt Post LRM 19 188 0 30 Apr 2020
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation Nils Reimers Iryna Gurevych 42 1,000 0 21 Apr 2020
SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings Masoud Jalili Sabet Philipp Dufter François Yvon Hinrich Schütze 23 228 0 18 Apr 2020
Translation Artifacts in Cross-lingual Transfer Learning Mikel Artetxe Gorka Labaka Eneko Agirre 27 115 0 09 Apr 2020
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation Dana Ruiter Josef van Genabith C. España-Bonet SSL 26 3 0 07 Apr 2020
Detecting and Understanding Generalization Barriers for Neural Machine Translation Guanlin Li Lemao Liu Conghui Zhu Tiejun Zhao Shuming Shi 28 0 0 05 Apr 2020
Machine Translation Pre-training for Data-to-Text Generation -- A Case Study in Czech Mihir Kale Scott Roy 14 14 0 05 Apr 2020
PMIndia -- A Collection of Parallel Corpora of Languages of India Barry Haddow Faheem Kirefu 19 102 0 27 Jan 2020
A Comprehensive Survey of Multilingual Neural Machine Translation Raj Dabre Chenhui Chu Anoop Kunchukuttan LRM 36 33 0 04 Jan 2020
Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering C. Carrino Marta R. Costa-jussá José A. R. Fonollosa 6 88 0 11 Dec 2019
GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies Marta R. Costa-jussá P. Lin C. España-Bonet SyDa 31 24 0 10 Dec 2019
JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus Makoto Morishita Jun Suzuki Masaaki Nagata LRM 38 64 0 25 Nov 2019
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB Holger Schwenk Guillaume Wenzek Sergey Edunov Edouard Grave Armand Joulin 33 256 0 10 Nov 2019
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs Ahmed El-Kishky Vishrav Chaudhary Francisco Guzman Philipp Koehn 28 198 0 10 Nov 2019
Should All Cross-Lingual Embeddings Speak English? Antonios Anastasopoulos Graham Neubig 19 31 0 08 Nov 2019
LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition Benjamin Beilharz Xin Sun Sariya Karimova Stefan Riezler 8 28 0 17 Oct 2019
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible Marcely Zanon Boito William N. Havard Mahault Garnerin Éric Le Ferrand Laurent Besacier 32 47 0 30 Jul 2019