ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.03720
  4. Cited By
Byte Pair Encoding is Suboptimal for Language Model Pretraining

Byte Pair Encoding is Suboptimal for Language Model Pretraining

7 April 2020
Kaj Bostrom
Greg Durrett
ArXivPDFHTML

Papers citing "Byte Pair Encoding is Suboptimal for Language Model Pretraining"

50 / 121 papers shown
Title
Improving Korean NLP Tasks with Linguistically Informed Subword
  Tokenization and Sub-character Decomposition
Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition
Tae-Hee Jeon
Bongseok Yang
ChangHwan Kim
Yoonseob Lim
21
0
0
07 Nov 2023
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into
  the Morphological Capabilities of a Large Language Model
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Leonie Weissweiler
Valentin Hofmann
Anjali Kantharuban
Anna Cai
Ritam Dutt
...
Abhishek Vijayakumar
Haofei Yu
Hinrich Schütze
Kemal Oflazer
David R. Mortensen
36
10
0
23 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
29
17
0
20 Oct 2023
Prediction of Arabic Legal Rulings using Large Language Models
Prediction of Arabic Legal Rulings using Large Language Models
Adel Ammar
Anis Koubaa
Bilel Benjdira
Omar Najar
Serry Sibaee
AILaw
ELM
25
7
0
16 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
23
47
0
12 Oct 2023
Exploring the Impact of Training Data Distribution and Subword
  Tokenization on Gender Bias in Machine Translation
Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
Bar Iluz
Tomasz Limisiewicz
Gabriel Stanovsky
David Marevcek
37
3
0
21 Sep 2023
Automated CVE Analysis for Threat Prioritization and Impact Prediction
Automated CVE Analysis for Threat Prioritization and Impact Prediction
Ehsan Aghaei
E. Al-Shaer
W. Shadid
Xi Niu
6
4
0
06 Sep 2023
Utilizing Semantic Textual Similarity for Clinical Survey Data Feature
  Selection
Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection
Benjamin C. Warner
Ziqi Xu
S. Haroutounian
Thomas Kannampallil
Chenyang Lu
22
2
0
19 Aug 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model
  Pretraining on Source Code
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Nadezhda Chirkova
Sergey Troshin
21
8
0
01 Aug 2023
MorphPiece : A Linguistic Tokenizer for Large Language Models
MorphPiece : A Linguistic Tokenizer for Large Language Models
Jeffrey Hsu
32
3
0
14 Jul 2023
A Formal Perspective on Byte-Pair Encoding
A Formal Perspective on Byte-Pair Encoding
Vilém Zouhar
Clara Meister
Juan Luis Gastaldi
Li Du
Tim Vieira
Mrinmaya Sachan
Ryan Cotterell
26
26
0
29 Jun 2023
How do different tokenizers perform on downstream tasks in scriptio
  continua languages?: A case study in Japanese
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
T. Fujii
Koki Shibata
Atsuki Yamaguchi
Terufumi Morishita
Yasuhiro Sogawa
26
13
0
16 Jun 2023
Tokenization with Factorized Subword Encoding
Tokenization with Factorized Subword Encoding
David Samuel
Lilja Øvrelid
41
1
0
13 Jun 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial
  Language Models
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
53
82
0
23 May 2023
SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly
Jordi Armengol-Estapé
Jackson Woodruff
Chris Cummins
Michael F. P. O'Boyle
48
16
0
21 May 2023
Effects of sub-word segmentation on performance of transformer language
  models
Effects of sub-word segmentation on performance of transformer language models
Jue Hou
Anisia Katinskaia
Anh Vu
R. Yangarber
21
4
0
09 May 2023
What is the best recipe for character-level encoder-only modelling?
What is the best recipe for character-level encoder-only modelling?
Kris Cao
42
2
0
09 May 2023
KINLP at SemEval-2023 Task 12: Kinyarwanda Tweet Sentiment Analysis
KINLP at SemEval-2023 Task 12: Kinyarwanda Tweet Sentiment Analysis
Antoine Nzeyimana
20
3
0
25 Apr 2023
Tokenization Preference for Human and Machine Learning Model: An
  Annotation Study
Tokenization Preference for Human and Machine Learning Model: An Annotation Study
Tatsuya Hiraoka
Tomoya Iwakura
32
1
0
21 Apr 2023
Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary
  Restriction as Post Processing
Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing
Tatsuya Hiraoka
Tomoya Iwakura
20
0
0
21 Apr 2023
EC^2: Emergent Communication for Embodied Control
EC^2: Emergent Communication for Embodied Control
Yao Mu
Shunyu Yao
Mingyu Ding
Ping Luo
Chuang Gan
LM&Ro
37
19
0
19 Apr 2023
BloombergGPT: A Large Language Model for Finance
BloombergGPT: A Large Language Model for Finance
Shijie Wu
Ozan Irsoy
Steven Lu
Vadim Dabravolski
Mark Dredze
Sebastian Gehrmann
P. Kambadur
David S. Rosenberg
Gideon Mann
AIFin
99
789
0
30 Mar 2023
SIGMORPHON 2023 Shared Task of Interlinear Glossing: Baseline Model
SIGMORPHON 2023 Shared Task of Interlinear Glossing: Baseline Model
Michael Ginn
16
7
0
24 Mar 2023
An Overview on Language Models: Recent Developments and Outlook
An Overview on Language Models: Recent Developments and Outlook
Chengwei Wei
Yun Cheng Wang
Bin Wang
C.-C. Jay Kuo
33
42
0
10 Mar 2023
Byte Pair Encoding for Symbolic Music
Byte Pair Encoding for Symbolic Music
Nathan Fradet
Nicolas Gutowski
F. Chhel
Jean-Pierre Briot
29
16
0
27 Jan 2023
A Measure-Theoretic Characterization of Tight Language Models
A Measure-Theoretic Characterization of Tight Language Models
Li Du
Lucas Torroba Hennigen
Tiago Pimentel
Clara Meister
Jason Eisner
Ryan Cotterell
36
30
0
20 Dec 2022
Tokenization Consistency Matters for Generative Models on Extractive NLP
  Tasks
Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks
Kaiser Sun
Peng Qi
Yuhao Zhang
Lan Liu
William Yang Wang
Zhiheng Huang
29
7
0
19 Dec 2022
Inducing Character-level Structure in Subword-based Language Models with
  Type-level Interchange Intervention Training
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training
Jing-ling Huang
Zhengxuan Wu
Kyle Mahowald
Christopher Potts
24
13
0
19 Dec 2022
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End
  Language Modeling
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
Nathan Godey
Roman Castagné
Eric Villemonte de la Clergerie
Benoît Sagot
21
3
0
14 Dec 2022
Efficient Transformers with Dynamic Token Pooling
Efficient Transformers with Dynamic Token Pooling
Piotr Nawrot
J. Chorowski
Adrian Lañcucki
Edoardo Ponti
22
42
0
17 Nov 2022
Incorporating Context into Subword Vocabularies
Incorporating Context into Subword Vocabularies
Shaked Yehezkel
Yuval Pinter
47
8
0
13 Oct 2022
Are word boundaries useful for unsupervised language learning?
Are word boundaries useful for unsupervised language learning?
Tu Nguyen
Maureen de Seyssel
Robin Algayres
Patricia Roze
Ewan Dunbar
Emmanuel Dupoux
49
9
0
06 Oct 2022
MaxMatch-Dropout: Subword Regularization for WordPiece
MaxMatch-Dropout: Subword Regularization for WordPiece
Tatsuya Hiraoka
51
8
0
09 Sep 2022
Lost in Space Marking
Lost in Space Marking
Cassandra L. Jacobs
Yuval Pinter
14
1
0
02 Aug 2022
Benchmarking Azerbaijani Neural Machine Translation
Benchmarking Azerbaijani Neural Machine Translation
Chih-Chen Chen
William Chen
21
0
0
29 Jul 2022
Language Modelling with Pixels
Language Modelling with Pixels
Phillip Rust
Jonas F. Lotz
Emanuele Bugliarello
Elizabeth Salesky
Miryam de Lhoneux
Desmond Elliott
VLM
38
46
0
14 Jul 2022
Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator
Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator
Mukhlis Amien
Chong Feng
Heyan Huang
24
0
0
01 Jul 2022
Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the
  Research Manifold
Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold
Sebastian Ruder
Ivan Vulić
Anders Søgaard
41
29
0
20 Jun 2022
Local Byte Fusion for Neural Machine Translation
Local Byte Fusion for Neural Machine Translation
Makesh Narsimhan Sreedhar
Xiangpeng Wan
Yu-Jie Cheng
Junjie Hu
34
4
0
23 May 2022
Quantifying Synthesis and Fusion and their Impact on Machine Translation
Quantifying Synthesis and Fusion and their Impact on Machine Translation
Arturo Oncevay
Duygu Ataman
N. V. Berkel
Barry Haddow
Alexandra Birch
Johannes Bjerva
20
3
0
06 May 2022
How Robust is Neural Machine Translation to Language Imbalance in
  Multilingual Tokenizer Training?
How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
Shiyue Zhang
Vishrav Chaudhary
Naman Goyal
James Cross
Guillaume Wenzek
Joey Tianyi Zhou
Francisco Guzman
38
16
0
29 Apr 2022
How can NLP Help Revitalize Endangered Languages? A Case Study and
  Roadmap for the Cherokee Language
How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language
Shiyue Zhang
B. Frey
Joey Tianyi Zhou
19
36
0
25 Apr 2022
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
  Learning
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning
Md. Mofijul Islam
Gustavo Aguilar
Pragaash Ponnusamy
Clint Solomon Mathialagan
Chengyuan Ma
Chenlei Guo
VLM
11
10
0
22 Apr 2022
Impact of Tokenization on Language Models: An Analysis for Turkish
Impact of Tokenization on Language Models: An Analysis for Turkish
Cagri Toraman
E. Yilmaz
Furkan Şahinuç
Oguzhan Ozcelik
38
74
0
19 Apr 2022
Improving Tokenisation by Alternative Treatment of Spaces
Improving Tokenisation by Alternative Treatment of Spaces
Edward Gow-Smith
Harish Tayyar Madabushi
Carolina Scarton
Aline Villavicencio
37
20
0
08 Apr 2022
Morphology Without Borders: Clause-Level Morphology
Morphology Without Borders: Clause-Level Morphology
Omer Goldman
Reut Tsarfaty
AILaw
44
3
0
25 Feb 2022
Fine-Tuning Transformers: Vocabulary Transfer
Fine-Tuning Transformers: Vocabulary Transfer
Vladislav D. Mosin
Igor Samenko
Alexey Tikhonov
Borislav M. Kozlovskii
Ivan P. Yamshchikov
25
19
0
29 Dec 2021
Between words and characters: A Brief History of Open-Vocabulary
  Modeling and Tokenization in NLP
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
Sabrina J. Mielke
Zaid Alyafeai
Elizabeth Salesky
Colin Raffel
Manan Dey
...
Arun Raja
Chenglei Si
Wilson Y. Lee
Benoît Sagot
Samson Tan
32
143
0
20 Dec 2021
Wine is Not v i n. -- On the Compatibility of Tokenizations Across
  Languages
Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages
Antonis Maronikolakis
Philipp Dufter
Hinrich Schütze
19
17
0
13 Sep 2021
You should evaluate your language model on marginal likelihood over
  tokenisations
You should evaluate your language model on marginal likelihood over tokenisations
Kris Cao
Laura Rimell
33
23
0
06 Sep 2021
Previous
123
Next