ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1911.06154
  4. Cited By
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

10 November 2019
Ahmed El-Kishky
Vishrav Chaudhary
Francisco Guzman
Philipp Koehn
ArXivPDFHTML

Papers citing "CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs"

50 / 121 papers shown
Title
In-Domain African Languages Translation Using LLMs and Multi-armed Bandits
In-Domain African Languages Translation Using LLMs and Multi-armed Bandits
Pratik Rakesh Singh
Kritarth Prasad
Mohammadi Zaki
Pankaj Wasnik
12
0
0
21 May 2025
A kinetic-based regularization method for data science applications
Abhisek Ganguly
Alessandro Gabbana
Vybhav Rao
Sauro Succi
Santosh Ansumali
57
0
0
06 Mar 2025
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Xiulin Yang
Tatsuya Aoyama
Yuekun Yao
Ethan Wilcox
52
1
0
26 Feb 2025
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando
Surangika Ranathunga
Nisansa de Silva
48
0
0
26 Feb 2025
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Muhammad Farid Adilazuarda
M. Wijanarko
Lucky Susanto
Khumaisa Nuráini
Derry Wijaya
Alham Fikri Aji
57
0
0
25 Feb 2025
Beyond Literal Token Overlap: Token Alignability for Multilinguality
Beyond Literal Token Overlap: Token Alignability for Multilinguality
Katharina Hämmerl
Tomasz Limisiewicz
Jindrich Libovický
Alexander Fraser
51
0
0
10 Feb 2025
Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Menglong Cui
Pengzhi Gao
Wei Liu
Jian Luan
Bin Wang
LRM
45
3
0
04 Feb 2025
Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches
Yomal De Mel
Kasun Wickramasinghe
Nisansa de Silva
Surangika Ranathunga
46
1
0
03 Jan 2025
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages
  with Negligible Cost
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Sen Xing
Muyan Zhong
Zeqiang Lai
Liangchen Li
Jing Liu
Yaohui Wang
Jifeng Dai
Wenhai Wang
90
1
0
02 Dec 2024
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs
Haneul Yoo
Cheonbok Park
Sangdoo Yun
Alice Oh
Hwaran Lee
37
3
0
04 Nov 2024
Responsible Multilingual Large Language Models: A Survey of Development,
  Applications, and Societal Impact
Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact
Junhua Liu
Bin Fu
LRM
37
1
0
23 Oct 2024
Bridging the Language Gaps in Large Language Models with Inference-Time
  Cross-Lingual Intervention
Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention
Weixuan Wang
Minghao Wu
Barry Haddow
Alexandra Birch
LRM
24
4
0
16 Oct 2024
State of NLP in Kenya: A Survey
State of NLP in Kenya: A Survey
Cynthia Jayne Amol
Everlyn Asiko Chimoto
Rose Delilah Gesicho
Antony M. Gitau
Naome A. Etori
...
Catherine Gitau
Antony Ndolo
Lilian D. A. Wanzare
Albert Njoroge Kahira
Ronald Tombe
34
1
0
13 Oct 2024
Cross-lingual Human-Preference Alignment for Neural Machine Translation
  with Direct Quality Optimization
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Kaden Uhlig
Joern Wuebker
Raphael Reinauer
John DeNero
43
0
0
26 Sep 2024
Pula: Training Large Language Models for Setswana
Pula: Training Large Language Models for Setswana
Nathan Brown
Vukosi Marivate
OSLM
45
0
0
05 Aug 2024
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language
  Models
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models
Jupinder Parmar
Sanjev Satheesh
M. Patwary
M. Shoeybi
Bryan Catanzaro
56
29
0
09 Jul 2024
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar
Shrimai Prabhumoye
Joseph Jennings
Bo Liu
Aastha Jhunjhunwala
Zhilin Wang
M. Patwary
M. Shoeybi
Bryan Catanzaro
53
6
0
08 Jul 2024
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
Di Wu
Christof Monz
36
1
0
02 Jul 2024
Fairness and Bias in Multimodal AI: A Survey
Fairness and Bias in Multimodal AI: A Survey
Tosin Adewumi
Lama Alkhaled
Namrata Gurung
G. V. Boven
Irene Pagliai
58
9
0
27 Jun 2024
Leveraging Large Language Models to Measure Gender Bias in Gendered
  Languages
Leveraging Large Language Models to Measure Gender Bias in Gendered Languages
Erik Derner
Sara Sansalvador de la Fuente
Yoan Gutiérrez
Paloma Moreda
Nuria Oliver
32
1
0
19 Jun 2024
Feriji: A French-Zarma Parallel Corpus, Glossary & Translator
Feriji: A French-Zarma Parallel Corpus, Glossary & Translator
Mamadou K. Keita
Elysabhete Amadou Ibrahim
Habibatou Abdoulaye Alfari
Christopher Homan
26
1
0
09 Jun 2024
Recovering document annotations for sentence-level bitext
Recovering document annotations for sentence-level bitext
R. Wicks
Matt Post
Philipp Koehn
39
4
0
06 Jun 2024
Smart Bilingual Focused Crawling of Parallel Documents
Smart Bilingual Focused Crawling of Parallel Documents
Cristian García-Romero
Miquel Espla-Gomis
Felipe Sánchez-Martínez
24
0
0
23 May 2024
A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
Masaaki Nagata
Makoto Morishita
Katsuki Chousa
Norihito Yasuda
29
2
0
15 May 2024
Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial
  African Creole Languages
Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages
Nathaniel R. Robinson
Raj Dabre
Ammon Shurtz
Rasul Dent
Onenamiyi Onesi
...
Matthew Dean Stutzman
Bismarck Odoom
Sanjeev Khudanpur
Stephen D. Richardson
Kenton Murray
MoE
49
6
0
08 May 2024
Charles Translator: A Machine Translation System between Ukrainian and
  Czech
Charles Translator: A Machine Translation System between Ukrainian and Czech
Martin Popel
Lucie Poláková
Michal Novák
Jindřich Helcl
Jindrich Libovický
Pavel Stranák
Tomás Krabac
Jaroslava Hlavácová
Mariia Anisimova
Tereza Chlanová
27
0
0
10 Apr 2024
Sailor: Open Language Models for South-East Asia
Sailor: Open Language Models for South-East Asia
Longxu Dou
Qian Liu
Guangtao Zeng
Jia Guo
Jiahui Zhou
Wei Lu
Min Lin
LRM
40
8
0
04 Apr 2024
Backdoor Attack on Multilingual Machine Translation
Backdoor Attack on Multilingual Machine Translation
Jun Wang
Qiongkai Xu
Xuanli He
Benjamin I. P. Rubinstein
Trevor Cohn
26
5
0
03 Apr 2024
Improving Vietnamese-English Medical Machine Translation
Improving Vietnamese-English Medical Machine Translation
Nhu Vo
Dat Quoc Nguyen
Dung D. Le
Massimo Piccardi
Wray Buntine
LM&MA
40
0
0
28 Mar 2024
LLMs Are Few-Shot In-Context Low-Resource Language Learners
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya
Holy Lovenia
Pascale Fung
48
37
0
25 Mar 2024
Tower: An Open Multilingual Large Language Model for Translation-Related
  Tasks
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
Duarte M. Alves
José P. Pombal
Nuno M. Guerreiro
Pedro H. Martins
Joao Alves
...
Patrick Fernandes
Sweta Agrawal
Pierre Colombo
José G. C. de Souza
André F.T. Martins
LRM
57
132
0
27 Feb 2024
GATE X-E : A Challenge Set for Gender-Fair Translations from
  Weakly-Gendered Languages
GATE X-E : A Challenge Set for Gender-Fair Translations from Weakly-Gendered Languages
Spencer Rarrick
Ranjita Naik
Sundar Poudel
Vishal Chowdhary
39
1
0
22 Feb 2024
Quality Does Matter: A Detailed Look at the Quality and Utility of
  Web-Mined Parallel Corpora
Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga
Nisansa de Silva
Menan Velayuthan
Aloka Fernando
Charitha Rathnayake
39
12
0
12 Feb 2024
Stolen Subwords: Importance of Vocabularies for Machine Translation
  Model Stealing
Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing
Vilém Zouhar
AAML
40
0
0
29 Jan 2024
Leveraging Closed-Access Multilingual Embedding for Automatic Sentence
  Alignment in Low Resource Languages
Leveraging Closed-Access Multilingual Embedding for Automatic Sentence Alignment in Low Resource Languages
Idris Abdulmumin
Auwal Abubakar Khalid
Shamsuddeen Hassan Muhammad
I. Ahmad
L. Aliyu
Babangida Sani
B.M. Abduljalil
Sani Ahmad Hassan
34
0
0
20 Nov 2023
Sinhala-English Word Embedding Alignment: Introducing Datasets and
  Benchmark for a Low Resource Language
Sinhala-English Word Embedding Alignment: Introducing Datasets and Benchmark for a Low Resource Language
Kasun Wickramasinghe
Nisansa de Silva
33
0
0
17 Nov 2023
Investigating Multi-Pivot Ensembling with Massively Multilingual Machine
  Translation Models
Investigating Multi-Pivot Ensembling with Massively Multilingual Machine Translation Models
Alireza Mohammadshahi
Jannis Vamvas
Rico Sennrich
LRM
32
0
0
13 Nov 2023
EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual
  Representation Learning
EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning
Ping Guo
Xiangpeng Wei
Yue Hu
Baosong Yang
Dayiheng Liu
Fei Huang
Jun Xie
26
2
0
26 Oct 2023
Tokenization and the Noiseless Channel
Tokenization and the Noiseless Channel
Vilém Zouhar
Clara Meister
Juan Luis Gastaldi
Li Du
Mrinmaya Sachan
Ryan Cotterell
30
31
0
29 Jun 2023
A Formal Perspective on Byte-Pair Encoding
A Formal Perspective on Byte-Pair Encoding
Vilém Zouhar
Clara Meister
Juan Luis Gastaldi
Li Du
Tim Vieira
Mrinmaya Sachan
Ryan Cotterell
26
26
0
29 Jun 2023
Learning Multilingual Sentence Representations with Cross-lingual
  Consistency Regularization
Learning Multilingual Sentence Representations with Cross-lingual Consistency Regularization
Pengzhi Gao
Liwen Zhang
Zhongjun He
Hua Wu
Haifeng Wang
35
6
0
12 Jun 2023
Leveraging Auxiliary Domain Parallel Data in Intermediate Task
  Fine-tuning for Low-resource Translation
Leveraging Auxiliary Domain Parallel Data in Intermediate Task Fine-tuning for Low-resource Translation
Shravan Nayak
Surangika Ranathunga
Sarubi Thillainathan
Rikki Hung
Anthony Rinaldi
Yining Wang
Jonah Mackey
Andrew Ho
E. Lee
26
5
0
02 Jun 2023
Eliciting the Translation Ability of Large Language Models via
  Multilingual Finetuning with Translation Instructions
Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions
Jiahuan Li
Hao Zhou
Shujian Huang
Shan Chen
Jiajun Chen
LRM
41
55
0
24 May 2023
Mitigating Data Imbalance and Representation Degeneration in
  Multilingual Machine Translation
Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation
Wen Lai
Alexandra Chronopoulou
Alexander Fraser
40
5
0
22 May 2023
Soft Prompt Decoding for Multilingual Dense Retrieval
Soft Prompt Decoding for Multilingual Dense Retrieval
Zhiqi Huang
Hansi Zeng
Hamed Zamani
James Allan
RALM
63
13
0
15 May 2023
Train Global, Tailor Local: Minimalist Multilingual Translation into
  Endangered Languages
Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages
Zhong Zhou
Jan Niehues
Alexander Waibel
37
0
0
05 May 2023
$\varepsilon$ KÚ <MASK>: Integrating Yorùbá cultural greetings
  into machine translation
ε\varepsilonε KÚ <MASK>: Integrating Yorùbá cultural greetings into machine translation
Idris Akinade
Jesujoba Oluwadara Alabi
David Ifeoluwa Adelani
Clement Odoje
Dietrich Klakow
25
9
0
31 Mar 2023
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse
  Heterogeneous Computing
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing
Xiaozhe Ren
Pingyi Zhou
Xinfan Meng
Xinjing Huang
Yadao Wang
...
Jiansheng Wei
Xin Jiang
Teng Su
Qun Liu
Jun Yao
ALM
MoE
75
61
0
20 Mar 2023
The ROOTS Search Tool: Data Transparency for LLMs
The ROOTS Search Tool: Data Transparency for LLMs
Aleksandra Piktus
Christopher Akiki
Paulo Villegas
Hugo Laurenccon
Gérard Dupont
A. Luccioni
Yacine Jernite
Anna Rogers
VLM
41
29
0
27 Feb 2023
Improving Cross-lingual Information Retrieval on Low-Resource Languages
  via Optimal Transport Distillation
Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Zhiqi Huang
Puxuan Yu
James Allan
VLM
40
26
0
29 Jan 2023
123
Next