Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1911.00359
Cited By
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
1 November 2019
Guillaume Wenzek
Marie-Anne Lachaux
Alexis Conneau
Vishrav Chaudhary
Francisco Guzmán
Armand Joulin
Edouard Grave
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data"
50 / 171 papers shown
Title
DTrOCR: Decoder-only Transformer for Optical Character Recognition
Masato Fujitake
64
35
0
30 Aug 2023
Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning
Jiasheng Ye
Zaixiang Zheng
Yu Bao
Lihua Qian
Quanquan Gu
DiffM
57
15
0
23 Aug 2023
Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval
Guangyuan Ma
Xing Wu
Peng Wang
Zijia Lin
Songlin Hu
RALM
45
10
0
16 Aug 2023
Towards General Text Embeddings with Multi-stage Contrastive Learning
Zehan Li
Xin Zhang
Yanzhao Zhang
Dingkun Long
Pengjun Xie
Meishan Zhang
71
352
0
07 Aug 2023
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation
Seugnjun Lee
Hyeonseok Moon
Chanjun Park
Heu-Jeoung Lim
32
0
0
26 Jun 2023
GIO: Gradient Information Optimization for Training Dataset Selection
Dante Everaert
Christopher Potts
23
3
0
20 Jun 2023
Large-scale Language Model Rescoring on Long-form Data
Tongzhou Chen
Cyril Allauzen
Yinghui Huang
Daniel S. Park
David Rybach
...
Rodrigo Cabrera
Kartik Audhkhasi
Bhuvana Ramabhadran
Pedro J. Moreno
Michael Riley
38
14
0
13 Jun 2023
Unsupervised Paraphrasing of Multiword Expressions
Takashi Wada
Yuji Matsumoto
Timothy Baldwin
Jey Han Lau
34
0
0
02 Jun 2023
Sentence Simplification Using Paraphrase Corpus for Initialization
Kaiyu Liu
Jipeng Qiang
16
0
0
31 May 2023
Parameter-Efficient Fine-Tuning without Introducing New Latency
Baohao Liao
Yan Meng
Christof Monz
24
49
0
26 May 2023
Revisiting non-English Text Simplification: A Unified Multilingual Benchmark
Michael Joseph Ryan
Tarek Naous
Wei Xu
31
25
0
25 May 2023
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model
Leo Liu
Tim Dettmers
Xi Lin
Ves Stoyanov
Xian Li
MoE
26
9
0
23 May 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
53
82
0
23 May 2023
Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models
Tim Schott
Daniel Furman
Shreshta Bhat
ELM
37
4
0
23 May 2023
GPT-SW3: An Autoregressive Language Model for the Nordic Languages
Ariel Ekgren
Amaru Cuba Gyllensten
Felix Stollenwerk
Joey Öhman
T. Isbister
Evangelia Gogoulou
F. Carlsson
Alice Heiman
Judit Casademont
Magnus Sahlgren
29
13
0
22 May 2023
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob Imani
Peiqin Lin
Amir Hossein Kargaran
Silvia Severini
Masoud Jalili Sabet
...
Chunlan Ma
Helmut Schmid
André F. T. Martins
François Yvon
Hinrich Schütze
ALM
LRM
49
96
0
20 May 2023
BERM: Training the Balanced and Extractable Representation for Matching to Improve Generalization Ability of Dense Retrieval
Shicheng Xu
Liang Pang
Huawei Shen
Xueqi Cheng
18
12
0
18 May 2023
How Good are Commercial Large Language Models on African Languages?
Jessica Ojo
Kelechi Ogueji
31
5
0
11 May 2023
MAUPQA: Massive Automatically-created Polish Question Answering Dataset
Piotr Rybak
28
12
0
09 May 2023
LatinCy: Synthetic Trained Pipelines for Latin NLP
Patrick J. Burns
17
10
0
07 May 2023
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
Verena Blaschke
Hinrich Schütze
Barbara Plank
27
13
0
19 Apr 2023
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab
Timothée Darcet
Théo Moutakanni
Huy Q. Vo
Marc Szafraniec
...
Hervé Jégou
Julien Mairal
Patrick Labatut
Armand Joulin
Piotr Bojanowski
VLM
CLIP
SSL
154
3,070
0
14 Apr 2023
PWESuite: Phonetic Word Embeddings and Tasks They Facilitate
Vilém Zouhar
Kalvin Chang
Chenxuan Cui
Nathaniel Carlson
Nathaniel R. Robinson
Mrinmaya Sachan
David R. Mortensen
34
2
0
05 Apr 2023
PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation
Alireza Salemi
Amirhossein Abaskohi
Sara Tavakoli
Yadollah Yaghoobzadeh
A. Shakery
AIMat
32
0
0
03 Apr 2023
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
Iakovos Evdaimon
Hadi Abdine
Christos Xypolopoulos
Stamatis Outsios
Michalis Vazirgiannis
Giorgos Stamou
VLM
36
7
0
03 Apr 2023
Can a Frozen Pretrained Language Model be used for Zero-shot Neural Retrieval on Entity-centric Questions?
Yasuto Hoshi
Daisuke Miyashita
Yasuhiro Morioka
Youyang Ng
Osamu Torii
J. Deguchi
26
0
0
09 Mar 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
73
12,418
0
27 Feb 2023
Sentence Simplification via Large Language Models
Yutao Feng
Jipeng Qiang
Yun Li
Yunhao Yuan
Yi Zhu
28
17
0
23 Feb 2023
Poisoning Web-Scale Training Datasets is Practical
Nicholas Carlini
Matthew Jagielski
Christopher A. Choquette-Choo
Daniel Paleka
Will Pearce
Hyrum S. Anderson
Andreas Terzis
Kurt Thomas
Florian Tramèr
SILM
31
182
0
20 Feb 2023
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick
Jane Dwivedi-Yu
Roberto Dessì
Roberta Raileanu
Maria Lomeli
Luke Zettlemoyer
Nicola Cancedda
Thomas Scialom
SyDa
RALM
43
1,608
0
09 Feb 2023
Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval
Shunyu Zhang
Yaobo Liang
Ming Gong
Daxin Jiang
Nan Duan
27
4
0
03 Feb 2023
Sample-Efficient Unsupervised Domain Adaptation of Speech Recognition Systems A case study for Modern Greek
Georgios Paraskevopoulos
Theodoros Kouzelis
Georgios Rouvalis
Athanasios Katsamanis
Vassilis Katsouros
Alexandros Potamianos
VLM
30
7
0
31 Dec 2022
Synthetic Pre-Training Tasks for Neural Machine Translation
Zexue He
Graeme W. Blackwood
Yikang Shen
Julian McAuley
Rogerio Feris
29
3
0
19 Dec 2022
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
Yekun Chai
Shuohuan Wang
Chao Pang
Yu Sun
Hao Tian
Hua Wu
38
36
0
13 Dec 2022
Efficient Transformers with Dynamic Token Pooling
Piotr Nawrot
J. Chorowski
Adrian Lañcucki
Edoardo Ponti
22
42
0
17 Nov 2022
mOKB6: A Multilingual Open Knowledge Base Completion Benchmark
Shubham Mittal
Keshav Kolluru
Soumen Chakrabarti
Mausam
35
4
0
13 Nov 2022
Addressing Segmentation Ambiguity in Neural Linguistic Steganography
Jumon Nozaki
Yugo Murawaki
15
5
0
12 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
MLLM
VLM
24
7
0
09 Nov 2022
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
Bonaventure F. P. Dossou
A. Tonja
Oreen Yousuf
Salomey Osei
Abigail Oppong
Iyanuoluwa Shode
Oluwabusayo Olufunke Awoyomi
Chris C. Emezue
32
51
0
07 Nov 2022
COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning
Yue Yu
Chenyan Xiong
Si Sun
Chao Zhang
Arnold Overwijk
VLM
OOD
52
22
0
27 Oct 2022
Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning
Barun Patra
Saksham Singhal
Shaohan Huang
Zewen Chi
Li Dong
Furu Wei
Vishrav Chaudhary
Xia Song
71
23
0
26 Oct 2022
MTet: Multi-domain Translation for English and Vietnamese
C. Ngo
Trieu H. Trinh
Long Phan
H. Tran
Tai Dang
Hieu Duy Nguyen
Minh Le Nguyen
Minh-Thang Luong
VLM
42
8
0
11 Oct 2022
Language Varieties of Italy: Technology Challenges and Opportunities
Alan Ramponi
27
7
0
20 Sep 2022
PEER: A Collaborative Language Model
Timo Schick
Jane Dwivedi-Yu
Zhengbao Jiang
Fabio Petroni
Patrick Lewis
Gautier Izacard
Qingfei You
Christoforos Nalmpantis
Edouard Grave
Sebastian Riedel
ALM
54
93
0
24 Aug 2022
BERTifying Sinhala -- A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification
Vinura Dhananjaya
Piyumal Demotte
Surangika Ranathunga
Sanath Jayasena
27
14
0
16 Aug 2022
Masked Autoencoders As The Unified Learners For Pre-Trained Sentence Representation
Alexander H. Liu
Samuel J. Yang
34
5
0
30 Jul 2022
BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Javier de la Rosa
E. G. Ponferrada
Paulo Villegas
Pablo González de Prado Salas
Manu Romero
María Grandury
35
95
0
14 Jul 2022
Re2G: Retrieve, Rerank, Generate
Michael R. Glass
Gaetano Rossiello
Md. Faisal Mahbub Chowdhury
Ankita Rajaram Naik
Pengshan Cai
A. Gliozzo
RALM
35
84
0
13 Jul 2022
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Peter Henderson
M. Krass
Lucia Zheng
Neel Guha
Christopher D. Manning
Dan Jurafsky
Daniel E. Ho
AILaw
ELM
138
97
0
01 Jul 2022
esCorpius: A Massive Spanish Crawling Corpus
Asier Gutiérrez-Fandiño
David Pérez-Fernández
Jordi Armengol-Estapé
D. Griol
Z. Callejas
51
2
0
30 Jun 2022
Previous
1
2
3
4
Next