Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.18376
Cited By
Tokenization Is More Than Compression
28 February 2024
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Tokenization Is More Than Compression"
20 / 20 papers shown
Title
S-DAT: A Multilingual, GenAI-Driven Framework for Automated Divergent Thinking Assessment
J. Haase
P. Hanel
Sebastian Pokutta
LRM
21
0
0
14 May 2025
Position: Foundation Models Need Digital Twin Representations
Yiqing Shen
Hao Ding
Lalithkumar Seenivasan
Tianmin Shu
Mathias Unberath
AI4CE
40
0
0
01 May 2025
GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction
Tung Sum Thomas Kwok
Chi-Hua Wang
Guang Cheng
LMTD
71
1
0
19 Mar 2025
Splintering Nonconcatenative Languages for Better Tokenization
Bar Gazit
Shaltiel Shmidman
Avi Shmidman
Yuval Pinter
57
0
0
18 Mar 2025
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
48
3
0
17 Mar 2025
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Xiulin Yang
Tatsuya Aoyama
Yuekun Yao
Ethan Wilcox
50
1
0
26 Feb 2025
Tokenization is Sensitive to Language Variation
Anna Wegmann
Dong Nguyen
David Jurgens
84
1
0
24 Feb 2025
When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization
Vivek Ramanujan
Kushal Tirumala
Armen Aghajanyan
Luke Zettlemoyer
Ali Farhadi
DiffM
74
2
0
20 Dec 2024
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
Gagan Bhatia
MingZe Tang
Cristina Mahanta
Madiha Kazi
79
0
0
17 Dec 2024
An Enhanced Text Compression Approach Using Transformer-based Language Models
C. M. Rahman
Mahbub E Sobhani
Anika Tasnim Rodela
Swakkhar Shatabda
69
0
0
15 Dec 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
48
12
0
08 Oct 2024
Understanding and Mitigating Tokenization Bias in Language Models
Buu Phan
Marton Havasi
Matthew Muckley
Karen Ullrich
44
3
0
24 Jun 2024
Lexically Grounded Subword Segmentation
Jindřich Libovický
Jindřich Helcl
35
1
0
19 Jun 2024
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Haoran Lian
Yizhe Xiong
Jianwei Niu
Shasha Mo
Zhenpeng Su
Zijia Lin
Peng Liu
Hui Chen
Guiguang Ding
34
1
0
27 Apr 2024
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
40
6
0
04 Apr 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Omer Goldman
Avi Caciularu
Matan Eyal
Kris Cao
Idan Szpektor
Reut Tsarfaty
51
22
0
10 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan
Craig W. Schmidt
Chris Tanner
Yuval Pinter
38
14
0
02 Mar 2024
Subobject-level Image Tokenization
Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung
VLM
OCL
54
7
0
22 Feb 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
279
1,996
0
31 Dec 2020
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
M. Schuster
Z. Chen
Quoc V. Le
Mohammad Norouzi
...
Alex Rudnick
Oriol Vinyals
G. Corrado
Macduff Hughes
J. Dean
AIMat
716
6,746
0
26 Sep 2016
1