Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2004.03720
Cited By
Byte Pair Encoding is Suboptimal for Language Model Pretraining
7 April 2020
Kaj Bostrom
Greg Durrett
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Byte Pair Encoding is Suboptimal for Language Model Pretraining"
21 / 121 papers shown
Title
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Andrew Jaegle
Sebastian Borgeaud
Jean-Baptiste Alayrac
Carl Doersch
Catalin Ionescu
...
Olivier J. Hénaff
M. Botvinick
Andrew Zisserman
Oriol Vinyals
João Carreira
MLLM
VLM
GNN
20
567
0
30 Jul 2021
Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains
Yunzhi Yao
Shaohan Huang
Wenhui Wang
Li Dong
Furu Wei
VLM
ALM
18
46
0
25 Jun 2021
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Yi Tay
Vinh Q. Tran
Sebastian Ruder
Jai Gupta
Hyung Won Chung
Dara Bahri
Zhen Qin
Simon Baumgartner
Cong Yu
Donald Metzler
51
153
0
23 Jun 2021
Evaluating Various Tokenizers for Arabic Text Classification
Zaid Alyafeai
Maged S. Al-Shaibani
Mustafa Ghaleb
Irfan Ahmad
37
41
0
14 Jun 2021
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality
Adithya V Ganesan
Matthew Matero
Aravind Reddy Ravula
Huy-Hien Vu
H. Andrew Schwartz
30
35
0
07 May 2021
How (Non-)Optimal is the Lexicon?
Tiago Pimentel
Irene Nikkarinen
Kyle Mahowald
Ryan Cotterell
Damián E. Blasi
35
23
0
29 Apr 2021
Multi-view Subword Regularization
Xinyi Wang
Sebastian Ruder
Graham Neubig
27
45
0
15 Mar 2021
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
J. Clark
Dan Garrette
Iulia Turc
John Wieting
36
210
0
11 Mar 2021
Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words
Valentin Hofmann
J. Pierrehumbert
Hinrich Schütze
19
69
0
02 Jan 2021
Morphology Matters: A Multilingual Language Modeling Analysis
Hyunji Hayley Park
Katherine J. Zhang
Coleman Haley
K. Steimel
Han Liu
Lane Schwartz
53
47
0
11 Dec 2020
Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks
Modestas Filipavicius
Matteo Manica
Joris Cadow
María Rodríguez Martínez
26
13
0
05 Dec 2020
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages
Kushal Kumar Jain
Adwait Deshpande
Kumar Shridhar
F. Laumann
Ayushman Dash
51
51
0
04 Nov 2020
Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
Gustavo Aguilar
Bryan McCann
Tong Niu
Nazneen Rajani
N. Keskar
Thamar Solorio
49
12
0
24 Oct 2020
Dynamic Contextualized Word Embeddings
Valentin Hofmann
J. Pierrehumbert
Hinrich Schütze
39
51
0
23 Oct 2020
UniCase -- Rethinking Casing in Language Models
Rafal Powalski
Tomasz Stanislawek
11
4
0
22 Oct 2020
An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks
Kyubyong Park
Joohong Lee
Seongbo Jang
Dawoon Jung
6
60
0
06 Oct 2020
Will it Unblend?
Yuval Pinter
Cassandra L. Jacobs
Jacob Eisenstein
21
14
0
18 Sep 2020
Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches
Juan Cruz-Benito
Sanjay Vishwakarma
Francisco Martín-Fernández
Ismael Faro Ibm Quantum
22
30
0
16 Sep 2020
A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models
Mohammad Zeineldeen
Albert Zeyer
Wei Zhou
T. Ng
Ralf Schluter
Hermann Ney
22
2
0
19 May 2020
DagoBERT: Generating Derivational Morphology with a Pretrained Language Model
Valentin Hofmann
J. Pierrehumbert
Hinrich Schütze
32
2
0
02 May 2020
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
718
6,748
0
26 Sep 2016
Previous
1
2
3