Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2004.03720
Cited By
Byte Pair Encoding is Suboptimal for Language Model Pretraining
7 April 2020
Kaj Bostrom
Greg Durrett
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Byte Pair Encoding is Suboptimal for Language Model Pretraining"
50 / 121 papers shown
Title
When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes
Marina Popova
Iaroslav Chelombitko
Aleksey Komissarov
25
0
0
13 May 2025
Boosting Performance on ARC is a Matter of Perspective
Daniel Franzen
Jan Disselhoff
David Hartmann
RALM
LRM
52
0
0
08 May 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Khalil Hennara
Sara Chrouf
Mohamed Motaism Hamed
Zeina Aldallal
Omar Hadid
Safwan AlModhayan
37
1
0
21 Apr 2025
Overcoming Vocabulary Constraints with Pixel-level Fallback
Jonas F. Lotz
Hendra Setiawan
Stephan Peitz
Yova Kementchedjhieva
43
0
0
02 Apr 2025
From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time
Mikkel Wildner Kildeberg
Emil Allerslev Schledermann
Nicolaj Larsen
Rob van der Goot
35
0
0
02 Apr 2025
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
M. Bommarito
Daniel Martin Katz
Jillian Bommarito
42
1
0
21 Mar 2025
UniNet: A Unified Multi-granular Traffic Modeling Framework for Network Security
Binghui Wu
D. Divakaran
M. Gurusamy
57
0
0
06 Mar 2025
Vision-LSTM: xLSTM as Generic Vision Backbone
Benedikt Alkin
M. Beck
Korbinian Poppel
Sepp Hochreiter
Johannes Brandstetter
VLM
64
43
0
24 Feb 2025
Deterministic Reversible Data Augmentation for Neural Machine Translation
Jiashu Yao
Heyan Huang
Zeming Liu
Yuhang Guo
51
0
0
21 Feb 2025
LLM Embeddings for Deep Learning on Tabular Data
Boshko Koloski
Andrei Margeloiu
Xiangjian Jiang
Blaž Škrlj
Nikola Simidjievski
M. Jamnik
LMTD
81
0
0
17 Feb 2025
Efficient Language Modeling for Low-Resource Settings with Hybrid RNN-Transformer Architectures
Gabriel Lindenmaier
Sean Papay
Sebastian Padó
65
0
0
02 Feb 2025
Iconicity in Large Language Models
Anna Marklová
Jiří Milička
Leonid Ryvkin
Ľudmila Lacková Bennet
Libuše Kormaníková
46
0
0
10 Jan 2025
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA
Lifeng Qiao
Peng Ye
Yuchen Ren
Weiqiang Bai
Chaoqi Liang
Xinzhu Ma
Nanqing Dong
W. Ouyang
86
2
0
18 Dec 2024
Xmodel-1.5: An 1B-scale Multilingual LLM
Wang Qun
Liu Yang
Lin Qingquan
Jiang Ling
LRM
44
0
0
15 Nov 2024
Morphological Typology in BPE Subword Productivity and Language Modeling
Iñigo Parra
36
0
0
31 Oct 2024
Evaluating Morphological Compositional Generalization in Large Language Models
Mete Ismayilzada
Yuan Chiang
Jonne Sälevä
Hale Sirin
Abdullatif Köksal
Bhuwan Dhingra
Antoine Bosselut
Lonneke van der Plas
Duygu Ataman
33
2
0
16 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
52
12
0
08 Oct 2024
Morphological evaluation of subwords vocabulary used by BETO language model
Óscar García-Sierra
Ana Fernández-Pampillón Cesteros
Miguel Ortega-Martín
41
0
0
03 Oct 2024
Exploring Language Model Generalization in Low-Resource Extractive QA
Saptarshi Sengupta
Wenpeng Yin
Preslav Nakov
Shreya Ghosh
Suhang Wang
27
0
0
27 Sep 2024
Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources
Issey Sukeda
ELM
47
1
0
18 Sep 2024
TeXBLEU: Automatic Metric for Evaluate LaTeX Format
Kyudan Jung
N. Kim
Hyongon Ryu
Sieun Hyeon
Seung-jun Lee
Hyeok-jae Lee
37
0
0
10 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
48
2
0
06 Sep 2024
Batching BPE Tokenization Merges
Alexander P. Morgan
32
0
0
05 Aug 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian
Shunji Wan
Claudia Tang
Youzhi Wang
Xuanming Zhang
Maximillian Chen
Zhou Yu
AAML
45
8
0
25 Jun 2024
Unsupervised Morphological Tree Tokenizer
Qingyang Zhu
Xiang Hu
Pengyu Ji
Wei Wu
Kewei Tu
39
0
0
21 Jun 2024
Infusing clinical knowledge into tokenisers for language models
Abul Hasan
Jinge Wu
Quang Ngoc Nguyen
Salomé Andres
Imane Guellil
Huayu Zhang
Arlene Casey
Beatrice Alex
Bruce Guthrie
Honghan Wu
46
1
0
20 Jun 2024
Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models
Manas Jhalani
Annervaz K M
Pushpak Bhattacharyya
29
0
0
14 Jun 2024
Entropy-Reinforced Planning with Large Language Models for Drug Discovery
Xuefeng Liu
Chih-chan Tien
Peng Ding
Songhao Jiang
Rick L. Stevens
45
4
0
11 Jun 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Sander Land
Max Bartolo
49
21
0
08 May 2024
Investigating Automatic Scoring and Feedback using Large Language Models
G. Katuka
Alexander Gain
Yen-Yun Yu
AI4Ed
ALM
25
3
0
01 May 2024
Nyonic Technical Report
Junfeng Tian
Rui-cang Wang
Cong Li
Yudong Zhou
Jun Liu
Jun Wang
38
0
0
24 Apr 2024
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Khuyagbaatar Batsuren
Ekaterina Vylomova
Verna Dankers
Tsetsuukhei Delgerbaatar
Omri Uzan
Yuval Pinter
Gábor Bella
35
9
0
20 Apr 2024
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
40
6
0
04 Apr 2024
Revisiting subword tokenization: A case study on affixal negation in large language models
Thinh Hung Truong
Yulia Otmakhova
Karin Verspoor
Trevor Cohn
Timothy Baldwin
47
2
0
03 Apr 2024
Forklift: An Extensible Neural Lifter
Jordi Armengol-Estapé
Rodrigo C. O. Rocha
Jackson Woodruff
Pasquale Minervini
Michael F. P. O'Boyle
35
0
0
01 Apr 2024
LLMs are Good Sign Language Translators
Jia Gong
Lin Geng Foo
Yixuan He
Hossein Rahmani
Jun Liu
SLR
76
25
0
01 Apr 2024
Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
M. Alrefaie
Nour Eldin Morsy
Nada Samir
25
6
0
17 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Omer Goldman
Avi Caciularu
Matan Eyal
Kris Cao
Idan Szpektor
Reut Tsarfaty
51
22
0
10 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan
Craig W. Schmidt
Chris Tanner
Yuval Pinter
43
14
0
02 Mar 2024
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
61
28
0
28 Feb 2024
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya K. Singh
DJ Strouse
43
46
0
22 Feb 2024
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
Aina Garí Soler
Matthieu Labeau
Chloé Clavel
VLM
42
2
0
22 Feb 2024
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain
Yanis Labrak
Adrien Bazoge
Oumaima El Khettari
Mickael Rouvier
Pacome Constant dit Beaufils
...
B. Daille
Solen Quiniou
Emmanuel Morin
P. Gourraud
Richard Dufour
LM&MA
34
6
0
20 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
27
7
0
16 Feb 2024
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
Ahnaf Mozib Samin
20
0
0
28 Jan 2024
Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies
Anaelia Ovalle
Ninareh Mehrabi
Palash Goyal
Jwala Dhamala
Kai-Wei Chang
Richard Zemel
Aram Galstyan
Yuval Pinter
Rahul Gupta
38
10
0
19 Dec 2023
Impact of Tokenization on LLaMa Russian Adaptation
Mikhail Tikhomirov
D. Chernyshev
27
4
0
05 Dec 2023
Multimodal Large Language Models: A Survey
Jiayang Wu
Wensheng Gan
Zefeng Chen
Shicheng Wan
Philip S. Yu
36
169
0
22 Nov 2023
Multi-teacher Distillation for Multilingual Spelling Correction
Jingfen Zhang
Xuan Guo
S. Bodapati
Christopher Potts
KELM
27
3
0
20 Nov 2023
Spoken Word2Vec: Learning Skipgram Embeddings from Speech
Mohammad Amaan Sayeed
Hanan Aldarmaki
24
0
0
15 Nov 2023
1
2
3
Next