Unpacking Tokenization: Evaluating Text Compression and its Correlation
with Model Performance

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

10 March 2024

Omer Goldman

Papers citing "Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance"

8 / 8 papers shown

Title
SuperBPE: Space Travel for Language Models Alisa Liu J. Hayase Valentin Hofmann Sewoong Oh Noah A. Smith Yejin Choi 43 3 0 17 Mar 2025
DateLogicQA: Benchmarking Temporal Biases in Large Language Models Gagan Bhatia MingZe Tang Cristina Mahanta Madiha Kazi 79 0 0 17 Dec 2024
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles Buu Phan Brandon Amos Itai Gat Marton Havasi Matthew Muckley Karen Ullrich 47 1 0 11 Oct 2024
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models Alexius Wadell Anoushka Bhutani Venkatasubramanian Viswanathan 127 0 0 19 Sep 2024
Tokenization Is More Than Compression Craig W. Schmidt Varshini Reddy Haoran Zhang Alec Alameddine Omri Uzan Yuval Pinter Chris Tanner 40 28 0 28 Feb 2024
Subobject-level Image Tokenization Delong Chen Samuel Cahyawijaya Jianfeng Liu Baoyuan Wang Pascale Fung VLM OCL 51 7 0 22 Feb 2024
OLMo: Accelerating the Science of Language Models Dirk Groeneveld Iz Beltagy Pete Walsh Akshita Bhagia Rodney Michael Kinney ... Jesse Dodge Kyle Lo Luca Soldaini Noah A. Smith Hanna Hajishirzi OSLM 135 358 0 01 Feb 2024
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 80 235 0 31 Dec 2020