Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1906.08885
Cited By
Low-Resource Corpus Filtering using Multilingual Sentence Embeddings
20 June 2019
Vishrav Chaudhary
Y. Tang
Francisco Guzmán
Holger Schwenk
Philipp Koehn
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Low-Resource Corpus Filtering using Multilingual Sentence Embeddings"
40 / 40 papers shown
Title
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
Aishwarya Pothula
Bhavana Akkiraju
Srihari Bandarupalli
Charan D
Santosh Kesiraju
Anil Kumar Vuppala
5
0
0
19 Jun 2025
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
Mahjabin Nahar
Eun-Ju Lee
Jin Won Park
Dongwon Lee
HILM
152
0
0
01 Apr 2025
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando
Surangika Ranathunga
Nisansa de Silva
135
0
0
26 Feb 2025
A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
Jorge del Pozo Lérida
Kamil Kojs
János Máté
Mikołaj Antoni Barański
Christian Hardmeier
109
0
0
27 Jan 2025
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
87
3
0
12 Oct 2024
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
Di Wu
Christof Monz
101
1
0
02 Jul 2024
Why Not Transform Chat Large Language Models to Non-English?
Xiang Geng
Ming Zhu
Jiahuan Li
Zhejian Lai
Wei Zou
...
Xinglin Lyu
Min Zhang
Jiajun Chen
Hao Yang
Shujian Huang
68
2
0
22 May 2024
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
98
49
0
25 Mar 2024
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
Brian Thompson
Mehak Preet Dhaliwal
Peter Frisch
Tobias Domhan
Marcello Federico
88
17
0
11 Jan 2024
Translation Aligned Sentence Embeddings for Turkish Language
Eren Unlu
Unver Ciftci
56
0
0
16 Nov 2023
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
Isaac Caswell
Lisa Wang
Isabel Papadimitriou
70
0
0
11 Nov 2023
There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
Jan-Thorsten Peter
David Vilar
Daniel Deutsch
Mara Finkelstein
Juraj Juraska
Markus Freitag
53
18
0
09 Nov 2023
Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding
Kaiyan Zhao
Qiyu Wu
Xin-Qiang Cai
Yoshimasa Tsuruoka
32
8
0
16 Sep 2023
Noisy Parallel Data Alignment
Ruoyu Xie
Antonios Anastasopoulos
56
3
0
23 Jan 2023
A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing Prediction of Political Polarity in Multilingual News Headlines
Swati Swati
Adrian Mladenic Grobelnik
Dunja Mladenić
M. Grobelnik
70
3
0
01 Dec 2022
Data Selection Curriculum for Neural Machine Translation
Tasnim Mohiuddin
Philipp Koehn
Vishrav Chaudhary
James Cross
Shruti Bhosale
Shafiq Joty
83
13
0
25 Mar 2022
Improve Sentence Alignment by Divide-and-conquer
Wu Zhang
26
0
0
18 Jan 2022
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation
Eleftheria Briakou
Sida Wang
Luke Zettlemoyer
Marjan Ghazvininejad
36
5
0
12 Nov 2021
Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer
Wenda Xu
Michael Stephen Saxon
Misha Sra
Wenjie Wang
MedIm
74
13
0
06 Oct 2021
Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages
Dana Ruiter
Dietrich Klakow
Josef van Genabith
C. España-Bonet
72
9
0
19 Jul 2021
Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment
Dilan Sachintha
Lakmali Piyarathna
Charith Rajitha
Surangika Ranathunga
56
3
0
12 Jun 2021
LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models
Hongyu Gong
Vishrav Chaudhary
Yuqing Tang
Francisco Guzmán
37
3
0
07 Jun 2021
Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora
G. Kumar
Philipp Koehn
Sanjeev Khudanpur
116
1
0
11 Mar 2021
Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions
Muhammad N. ElNokrashy
Amr Hendy
M. Abdelghaffar
Mohamed Afify
Ahmed Tawfik
Hany Awadalla
41
3
0
16 Nov 2020
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou
Graham Neubig
Jiatao Gu
Mona T. Diab
P. Guzmán
Luke Zettlemoyer
Marjan Ghazvininejad
HILM
133
200
0
05 Nov 2020
A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning
Chang Xu
Jun Wang
Yuqing Tang
Francisco Guzman
Benjamin I. P. Rubinstein
Trevor Cohn
AAML
79
7
0
02 Nov 2020
Beyond English-Centric Multilingual Machine Translation
Angela Fan
Shruti Bhosale
Holger Schwenk
Zhiyi Ma
Ahmed El-Kishky
...
Vitaliy Liptchinsky
Sergey Edunov
Edouard Grave
Michael Auli
Armand Joulin
LRM
96
859
0
21 Oct 2020
Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank
Eleftheria Briakou
Marine Carpuat
57
26
0
07 Oct 2020
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
Tahmid Hasan
Abhik Bhattacharjee
Kazi Samin Mubasshir
Masum Hasan
Madhusudan Basak
M. Rahman
Rifat Shahriyar
VLM
77
77
0
20 Sep 2020
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
Lalita Lowphansirikul
Charin Polpanumas
Attapol T. Rutherford
Sarana Nutanong
LRM
39
23
0
07 Jul 2020
Cross-lingual Retrieval for Iterative Self-Supervised Training
C. Tran
Y. Tang
Xian Li
Jiatao Gu
RALM
70
75
0
16 Jun 2020
Parallel Corpus Filtering via Pre-trained Language Models
Boliang Zhang
Ajay Nagesh
Kevin Knight
73
31
0
13 May 2020
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
Brian Thompson
Matt Post
LRM
65
190
0
30 Apr 2020
Exploiting Sentence Order in Document Alignment
Brian Thompson
Philipp Koehn
46
19
0
30 Apr 2020
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
Dana Ruiter
Josef van Genabith
C. España-Bonet
SSL
51
3
0
07 Apr 2020
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Haiyue Song
Raj Dabre
Atsushi Fujita
Sadao Kurohashi
81
4
0
26 Dec 2019
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
Holger Schwenk
Guillaume Wenzek
Sergey Edunov
Edouard Grave
Armand Joulin
96
262
0
10 Nov 2019
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Ahmed El-Kishky
Vishrav Chaudhary
Francisco Guzman
Philipp Koehn
103
199
0
10 Nov 2019
Facebook AI's WAT19 Myanmar-English Translation Task Submission
Peng-Jen Chen
Jiajun Shen
Matt Le
Vishrav Chaudhary
Ahmed El-Kishky
Guillaume Wenzek
Myle Ott
MarcÁurelio Ranzato
38
29
0
15 Oct 2019
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk
Vishrav Chaudhary
Shuo Sun
Hongyu Gong
Francisco Guzmán
CVBM
118
407
0
10 Jul 2019
1