ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1906.08885
  4. Cited By
Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

20 June 2019
Vishrav Chaudhary
Y. Tang
Francisco Guzmán
Holger Schwenk
Philipp Koehn
ArXiv (abs)PDFHTML

Papers citing "Low-Resource Corpus Filtering using Multilingual Sentence Embeddings"

40 / 40 papers shown
Title
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
Aishwarya Pothula
Bhavana Akkiraju
Srihari Bandarupalli
Charan D
Santosh Kesiraju
Anil Kumar Vuppala
5
0
0
19 Jun 2025
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
Mahjabin Nahar
Eun-Ju Lee
Jin Won Park
Dongwon Lee
HILM
152
0
0
01 Apr 2025
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando
Surangika Ranathunga
Nisansa de Silva
135
0
0
26 Feb 2025
A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain
Jorge del Pozo Lérida
Kamil Kojs
János Máté
Mikołaj Antoni Barański
Christian Hardmeier
109
0
0
27 Jan 2025
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
87
3
0
12 Oct 2024
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
Di Wu
Christof Monz
101
1
0
02 Jul 2024
Why Not Transform Chat Large Language Models to Non-English?
Why Not Transform Chat Large Language Models to Non-English?
Xiang Geng
Ming Zhu
Jiahuan Li
Zhejian Lai
Wei Zou
...
Xinglin Lyu
Min Zhang
Jiajun Chen
Hao Yang
Shujian Huang
68
2
0
22 May 2024
LLMs Are Few-Shot In-Context Low-Resource Language Learners
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
98
49
0
25 Mar 2024
A Shocking Amount of the Web is Machine Translated: Insights from
  Multi-Way Parallelism
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
Brian Thompson
Mehak Preet Dhaliwal
Peter Frisch
Tobias Domhan
Marcello Federico
88
17
0
11 Jan 2024
Translation Aligned Sentence Embeddings for Turkish Language
Translation Aligned Sentence Embeddings for Turkish Language
Eren Unlu
Unver Ciftci
56
0
0
16 Nov 2023
Separating the Wheat from the Chaff with BREAD: An open-source benchmark
  and metrics to detect redundancy in text
Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text
Isaac Caswell
Lisa Wang
Isabel Papadimitriou
70
0
0
11 Nov 2023
There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
There's no Data Like Better Data: Using QE Metrics for MT Data Filtering
Jan-Thorsten Peter
David Vilar
Daniel Deutsch
Mara Finkelstein
Juraj Juraska
Markus Freitag
53
18
0
09 Nov 2023
Leveraging Multi-lingual Positive Instances in Contrastive Learning to
  Improve Sentence Embedding
Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding
Kaiyan Zhao
Qiyu Wu
Xin-Qiang Cai
Yoshimasa Tsuruoka
32
8
0
16 Sep 2023
Noisy Parallel Data Alignment
Noisy Parallel Data Alignment
Ruoyu Xie
Antonios Anastasopoulos
56
3
0
23 Jan 2023
A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing
  Prediction of Political Polarity in Multilingual News Headlines
A Commonsense-Infused Language-Agnostic Learning Framework for Enhancing Prediction of Political Polarity in Multilingual News Headlines
Swati Swati
Adrian Mladenic Grobelnik
Dunja Mladenić
M. Grobelnik
70
3
0
01 Dec 2022
Data Selection Curriculum for Neural Machine Translation
Data Selection Curriculum for Neural Machine Translation
Tasnim Mohiuddin
Philipp Koehn
Vishrav Chaudhary
James Cross
Shruti Bhosale
Shafiq Joty
83
13
0
25 Mar 2022
Improve Sentence Alignment by Divide-and-conquer
Improve Sentence Alignment by Divide-and-conquer
Wu Zhang
26
0
0
18 Jan 2022
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
  Translation
BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine Translation
Eleftheria Briakou
Sida Wang
Luke Zettlemoyer
Marjan Ghazvininejad
36
5
0
12 Nov 2021
Self-Supervised Knowledge Assimilation for Expert-Layman Text Style
  Transfer
Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer
Wenda Xu
Michael Stephen Saxon
Misha Sra
Wenjie Wang
MedIm
74
13
0
06 Oct 2021
Integrating Unsupervised Data Generation into Self-Supervised Neural
  Machine Translation for Low-Resource Languages
Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages
Dana Ruiter
Dietrich Klakow
Josef van Genabith
C. España-Bonet
72
9
0
19 Jul 2021
Exploiting Parallel Corpora to Improve Multilingual Embedding based
  Document and Sentence Alignment
Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment
Dilan Sachintha
Lakmali Piyarathna
Charith Rajitha
Surangika Ranathunga
56
3
0
12 Jun 2021
LAWDR: Language-Agnostic Weighted Document Representations from
  Pre-trained Models
LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models
Hongyu Gong
Vishrav Chaudhary
Yuqing Tang
Francisco Guzmán
37
3
0
07 Jun 2021
Learning Feature Weights using Reward Modeling for Denoising Parallel
  Corpora
Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora
G. Kumar
Philipp Koehn
Sanjeev Khudanpur
116
1
0
11 Mar 2021
Score Combination for Improved Parallel Corpus Filtering for Low
  Resource Conditions
Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions
Muhammad N. ElNokrashy
Amr Hendy
M. Abdelghaffar
Mohamed Afify
Ahmed Tawfik
Hany Awadalla
41
3
0
16 Nov 2020
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou
Graham Neubig
Jiatao Gu
Mona T. Diab
P. Guzmán
Luke Zettlemoyer
Marjan Ghazvininejad
HILM
133
200
0
05 Nov 2020
A Targeted Attack on Black-Box Neural Machine Translation with Parallel
  Data Poisoning
A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning
Chang Xu
Jun Wang
Yuqing Tang
Francisco Guzman
Benjamin I. P. Rubinstein
Trevor Cohn
AAML
79
7
0
02 Nov 2020
Beyond English-Centric Multilingual Machine Translation
Beyond English-Centric Multilingual Machine Translation
Angela Fan
Shruti Bhosale
Holger Schwenk
Zhiyi Ma
Ahmed El-Kishky
...
Vitaliy Liptchinsky
Sergey Edunov
Edouard Grave
Michael Auli
Armand Joulin
LRM
96
859
0
21 Oct 2020
Detecting Fine-Grained Cross-Lingual Semantic Divergences without
  Supervision by Learning to Rank
Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank
Eleftheria Briakou
Marine Carpuat
57
26
0
07 Oct 2020
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New
  Datasets for Bengali-English Machine Translation
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
Tahmid Hasan
Abhik Bhattacharjee
Kazi Samin Mubasshir
Masum Hasan
Madhusudan Basak
M. Rahman
Rifat Shahriyar
VLM
77
77
0
20 Sep 2020
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
Lalita Lowphansirikul
Charin Polpanumas
Attapol T. Rutherford
Sarana Nutanong
LRM
39
23
0
07 Jul 2020
Cross-lingual Retrieval for Iterative Self-Supervised Training
Cross-lingual Retrieval for Iterative Self-Supervised Training
C. Tran
Y. Tang
Xian Li
Jiatao Gu
RALM
70
75
0
16 Jun 2020
Parallel Corpus Filtering via Pre-trained Language Models
Parallel Corpus Filtering via Pre-trained Language Models
Boliang Zhang
Ajay Nagesh
Kevin Knight
73
31
0
13 May 2020
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot
  Paraphrasing
Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
Brian Thompson
Matt Post
LRM
65
190
0
30 Apr 2020
Exploiting Sentence Order in Document Alignment
Exploiting Sentence Order in Document Alignment
Brian Thompson
Philipp Koehn
46
19
0
30 Apr 2020
Self-Induced Curriculum Learning in Self-Supervised Neural Machine
  Translation
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
Dana Ruiter
Josef van Genabith
C. España-Bonet
SSL
51
3
0
07 Apr 2020
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures
  Translation
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Haiyue Song
Raj Dabre
Atsushi Fujita
Sadao Kurohashi
81
4
0
26 Dec 2019
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
Holger Schwenk
Guillaume Wenzek
Sergey Edunov
Edouard Grave
Armand Joulin
96
262
0
10 Nov 2019
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Ahmed El-Kishky
Vishrav Chaudhary
Francisco Guzman
Philipp Koehn
103
199
0
10 Nov 2019
Facebook AI's WAT19 Myanmar-English Translation Task Submission
Facebook AI's WAT19 Myanmar-English Translation Task Submission
Peng-Jen Chen
Jiajun Shen
Matt Le
Vishrav Chaudhary
Ahmed El-Kishky
Guillaume Wenzek
Myle Ott
MarcÁurelio Ranzato
38
29
0
15 Oct 2019
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from
  Wikipedia
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk
Vishrav Chaudhary
Shuo Sun
Hongyu Gong
Francisco Guzmán
CVBM
118
407
0
10 Jul 2019
1