Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.24689
Cited By
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
30 May 2025
Sander Land
Catherine Arnett
Re-assign community
ArXiv
PDF
HTML
Papers citing
"BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization"
13 / 13 papers shown
Title
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
79
6
0
17 Mar 2025
Tokenization is Sensitive to Language Variation
Anna Wegmann
Dong Nguyen
David Jurgens
117
2
0
24 Feb 2025
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
Menan Velayuthan
Kengatharaiyer Sarveswaran
55
6
0
17 Sep 2024
Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang
Catherine Arnett
Zhuowen Tu
Benjamin Bergen
LRM
83
7
0
19 Aug 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz
Terra Blevins
Hila Gonen
Orevaoghene Ahia
Luke Zettlemoyer
54
15
0
15 Mar 2024
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
Catherine Arnett
Tyler A. Chang
Benjamin Bergen
45
4
0
01 Mar 2024
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
72
31
0
28 Feb 2024
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya K. Singh
DJ Strouse
58
53
0
22 Feb 2024
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan
Gabriele Synnaeve
Baptiste Rozière
66
21
0
01 Feb 2024
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan Rossi
Thien Huu Nguyen
71
102
0
17 Sep 2023
Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks
Tiedong Liu
K. H. Low
ALM
53
84
0
23 May 2023
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo
John Richardson
142
3,490
0
19 Aug 2018
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich
Barry Haddow
Alexandra Birch
153
7,683
0
31 Aug 2015
1