Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1911.06154
Cited By
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
10 November 2019
Ahmed El-Kishky
Vishrav Chaudhary
Francisco Guzman
Philipp Koehn
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs"
40 / 40 papers shown
Title
A kinetic-based regularization method for data science applications
Abhisek Ganguly
Alessandro Gabbana
Vybhav Rao
Sauro Succi
Santosh Ansumali
57
0
0
06 Mar 2025
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Muhammad Farid Adilazuarda
M. Wijanarko
Lucky Susanto
Khumaisa Nuráini
Derry Wijaya
Alham Fikri Aji
57
0
0
25 Feb 2025
Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Menglong Cui
Pengzhi Gao
Wei Liu
Jian Luan
Bin Wang
LRM
45
2
0
04 Feb 2025
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
Di Wu
Christof Monz
36
1
0
02 Jul 2024
Charles Translator: A Machine Translation System between Ukrainian and Czech
Martin Popel
Lucie Poláková
Michal Novák
Jindřich Helcl
Jindrich Libovický
Pavel Stranák
Tomás Krabac
Jaroslava Hlavácová
Mariia Anisimova
Tereza Chlanová
27
0
0
10 Apr 2024
GATE X-E : A Challenge Set for Gender-Fair Translations from Weakly-Gendered Languages
Spencer Rarrick
Ranjita Naik
Sundar Poudel
Vishal Chowdhary
39
1
0
22 Feb 2024
Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing
Vilém Zouhar
AAML
40
0
0
29 Jan 2024
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation
Wen Lai
Alexandra Chronopoulou
Alexander Fraser
40
5
0
22 May 2023
Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages
Zhong Zhou
Jan Niehues
Alexander Waibel
37
0
0
05 May 2023
Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Zhiqi Huang
Puxuan Yu
James Allan
VLM
38
26
0
29 Jan 2023
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation
Yaoming Zhu
Zewei Sun
Shanbo Cheng
Yuyang Huang
Liwei Wu
Mingxuan Wang
28
10
0
20 Dec 2022
GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator
Jian Yang
Shuming Ma
Li Dong
Shaohan Huang
Haoyang Huang
Yuwei Yin
Dongdong Zhang
Liqun Yang
Furu Wei
Zhoujun Li
SyDa
AI4CE
34
25
0
20 Dec 2022
Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models
Hongyuan Lu
Haoyang Huang
Shuming Ma
Dongdong Zhang
W. Lam
Furu Wei
32
4
0
15 Dec 2022
Frustratingly Easy Label Projection for Cross-lingual Transfer
Yang Chen
Chao Jiang
Alan Ritter
Wei Xu
27
31
0
28 Nov 2022
Learning an Artificial Language for Knowledge-Sharing in Multilingual Translation
Danni Liu
Jan Niehues
21
5
0
02 Nov 2022
SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages
Alireza Mohammadshahi
Vassilina Nikoulina
Alexandre Berard
Caroline Brun
James Henderson
Laurent Besacier
VLM
MoE
LRM
29
20
0
20 Oct 2022
Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages
Idris Abdulmumin
Michael Beukman
Jesujoba Oluwadara Alabi
Chris C. Emezue
Everlyn Asiko
...
Shamsuddeen Hassan Muhammad
Mofetoluwa Adeyemi
Oreen Yousuf
Sahib Singh
T. Gwadabe
34
8
0
19 Oct 2022
MTet: Multi-domain Translation for English and Vietnamese
C. Ngo
Trieu H. Trinh
Long Phan
H. Tran
Tai Dang
Hieu Duy Nguyen
Minh Le Nguyen
Minh-Thang Luong
VLM
42
8
0
11 Oct 2022
Language Varieties of Italy: Technology Challenges and Opportunities
Alan Ramponi
27
7
0
20 Sep 2022
esCorpius: A Massive Spanish Crawling Corpus
Asier Gutiérrez-Fandiño
David Pérez-Fernández
Jordi Armengol-Estapé
D. Griol
Z. Callejas
51
2
0
30 Jun 2022
What Do Compressed Multilingual Machine Translation Models Forget?
Alireza Mohammadshahi
Vassilina Nikoulina
Alexandre Berard
Caroline Brun
James Henderson
Laurent Besacier
AI4CE
44
9
0
22 May 2022
Building Machine Translation Systems for the Next Thousand Languages
Ankur Bapna
Isaac Caswell
Julia Kreutzer
Orhan Firat
D. Esch
...
Apurva Shah
Yanping Huang
Zhehuai Chen
Yonghui Wu
Macduff Hughes
56
98
0
09 May 2022
Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
E. Lee
Sarubi Thillainathan
Shravan Nayak
Surangika Ranathunga
David Ifeoluwa Adelani
Ruisi Su
Arya D. McCarthy
VLM
21
43
0
16 Mar 2022
DeepNet: Scaling Transformers to 1,000 Layers
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Dongdong Zhang
Furu Wei
MoE
AI4CE
30
157
0
01 Mar 2022
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Yamini Bansal
Behrooz Ghorbani
Ankush Garg
Biao Zhang
M. Krikun
Colin Cherry
Behnam Neyshabur
Orhan Firat
42
47
0
04 Feb 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
45
153
0
17 Jan 2022
Improving Large-scale Language Models and Resources for Filipino
Jan Christian Blaise Cruz
C. Cheng
AI4CE
29
27
0
11 Nov 2021
PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation
Long Doan
L. T. Nguyen
Nguyen Luong Tran
T. Hoang
Dat Quoc Nguyen
33
22
0
23 Oct 2021
We Need to Talk About Data: The Importance of Data Readiness in Natural Language Processing
Fredrik Olsson
Magnus Sahlgren
26
1
0
11 Oct 2021
Improving Arabic Diacritization by Learning to Diacritize and Translate
Brian Thompson
A. Alshehri
42
10
0
29 Sep 2021
Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents
Biao Zhang
Ankur Bapna
Melvin Johnson
A. Dabirmoghaddam
N. Arivazhagan
Orhan Firat
34
12
0
21 Sep 2021
Facebook AI WMT21 News Translation Task Submission
C. Tran
Shruti Bhosale
James Cross
Philipp Koehn
Sergey Edunov
Angela Fan
VLM
134
81
0
06 Aug 2021
PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining
Machel Reid
Mikel Artetxe
VLM
50
26
0
04 Aug 2021
The USYD-JD Speech Translation System for IWSLT 2021
Liang Ding
Di Wu
Dacheng Tao
37
16
0
24 Jul 2021
A Survey on Low-Resource Neural Machine Translation
Rui Wang
Xu Tan
Renqian Luo
Tao Qin
Tie-Yan Liu
3DV
40
58
0
09 Jul 2021
Machine Translation into Low-resource Language Varieties
Sachin Kumar
Antonios Anastasopoulos
S. Wintner
Yulia Tsvetkov
11
29
0
12 Jun 2021
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
Naman Goyal
Cynthia Gao
Vishrav Chaudhary
Peng-Jen Chen
Guillaume Wenzek
Da Ju
Sanjan Krishnan
MarcÁurelio Ranzato
Francisco Guzman
Angela Fan
15
559
0
06 Jun 2021
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
Gowtham Ramesh
Sumanth Doddapaneni
Aravinth Bheemaraj
Mayank Jobanputra
AK Raghavan
...
K. Deepak
Vivek Raghavan
Anoop Kunchukuttan
Pratyush Kumar
Mitesh Khapra
LRM
37
231
0
12 Apr 2021
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer
Isaac Caswell
Lisa Wang
Ahsan Wahab
D. Esch
...
Duygu Ataman
Orevaoghene Ahia
Oghenefego Ahia
Sweta Agrawal
Mofetoluwa Adeyemi
20
269
0
22 Mar 2021
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk
Vishrav Chaudhary
Shuo Sun
Hongyu Gong
Francisco Guzmán
CVBM
29
401
0
10 Jul 2019
1