A Shocking Amount of the Web is Machine Translated: Insights from
Multi-Way Parallelism

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism

11 January 2024

Mehak Preet Dhaliwal

Marcello Federico

Papers citing "A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism"

17 / 17 papers shown

Title
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? HyoJung Han Akiko Eriguchi Haoran Xu Hieu T. Hoang Marine Carpuat Huda Khayrallah VLM 69 3 0 12 Oct 2024
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization Kaden Uhlig Joern Wuebker Raphael Reinauer John DeNero 80 0 0 26 Sep 2024
Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing William Brannon Yogesh Virkar Brian Thompson 52 22 0 23 Dec 2022
What Language Model to Train if You Have One Million GPU Hours? Teven Le Scao Thomas Wang Daniel Hesslow Lucile Saulnier Stas Bekman ... Lintang Sutawika Jaesung Tae Zheng-Xin Yong Julien Launay Iz Beltagy MoE AI4CE 265 107 0 27 Oct 2022
No Language Left Behind: Scaling Human-Centered Machine Translation Nllb team Marta R. Costa-jussá James Cross Onur cCelebi Maha Elbayad ... Alexandre Mourachko C. Ropers Safiyyah Saleem Holger Schwenk Jeff Wang MoE 215 1,258 0 11 Jul 2022
PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra ... Kathy Meier-Hellstern Douglas Eck J. Dean Slav Petrov Noah Fiedel PILM LRM 459 6,231 0 05 Apr 2022
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Jesse Dodge Maarten Sap Ana Marasović William Agnew Gabriel Ilharco Dirk Groeneveld Margaret Mitchell Matt Gardner AILaw 110 446 0 18 Apr 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 436 2,091 0 31 Dec 2020
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus Isaac Caswell Theresa Breiner D. Esch Ankur Bapna 65 89 0 27 Oct 2020
Language-agnostic BERT Sentence Embedding Fangxiaoyu Feng Yinfei Yang Daniel Cer N. Arivazhagan Wei Wang 159 904 0 03 Jul 2020
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing Brian Thompson Matt Post LRM 53 190 0 30 Apr 2020
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB Holger Schwenk Guillaume Wenzek Sergey Edunov Edouard Grave Armand Joulin 79 260 0 10 Nov 2019
Low-Resource Corpus Filtering using Multilingual Sentence Embeddings Vishrav Chaudhary Y. Tang Francisco Guzmán Holger Schwenk Philipp Koehn 62 79 0 20 Jun 2019
Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora Marcin Junczys-Dowmunt 45 135 0 01 Sep 2018
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation Antonio Toral Sheila Castilho Ke Hu Andy Way 48 190 0 30 Aug 2018
Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation Samuel Läubli Rico Sennrich M. Volk 41 258 0 21 Aug 2018
Billion-scale similarity search with GPUs Jeff Johnson Matthijs Douze Hervé Jégou 257 3,720 0 28 Feb 2017