ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.17067
  4. Cited By
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
v1v2 (latest)

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

27 May 2024
Dixuan Wang
Yanda Li
Junyuan Jiang
Zepeng Ding
Ziqin Luo
Guochao Jiang
Jiaqing Liang
Deqing Yang
ArXiv (abs)PDFHTML

Papers citing "Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization"

50 / 52 papers shown
Title
Incorporating Domain Knowledge into Materials Tokenization
Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh
Jun-Hyung Park
Junho Kim
SungHo Kim
S. Lee
20
0
0
09 Jun 2025
Causal Estimation of Tokenisation Bias
Causal Estimation of Tokenisation Bias
Pietro Lesci
Clara Meister
Thomas Hofmann
Andreas Vlachos
Tiago Pimentel
70
1
0
03 Jun 2025
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
Hanjia Lyu
Jiebo Luo
Jian Kang
Allison Koenecke
51
1
0
28 May 2025
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
Yu Zhang
Jialei Zhou
Xinchen Li
Qi Zhang
Zhongwei Wan
Tianyu Wang
Duoqian Miao
Changwei Wang
LongBing Cao
DiffM
60
2
0
25 May 2025
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns
Letitia Parcalabescu
Stephan Wäldchen
Michael Barlow
Gregor Ziegltrum
Volker Stampa
Bastian Harren
Björn Deiseroth
SyDa
138
0
0
24 Apr 2025
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
Jian Jia
Jingtong Gao
Ben Xue
Junhao Wang
Qingpeng Cai
Quan Chen
Xiangyu Zhao
Peng Jiang
Kun Gai
OffRL
145
2
0
18 Feb 2025
Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer
  Quality in Large Language Models
Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models
Iaroslav Chelombitko
Egor Safronov
Aleksey Komissarov
76
1
0
16 Oct 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
85
5
0
06 Sep 2024
GlitchProber: Advancing Effective Detection and Mitigation of Glitch
  Tokens in Large Language Models
GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models
Zhibo Zhang
Wuxia Bai
Yuxi Li
Max Meng
Kaidi Wang
Ling Shi
Li Li
Jun Wang
Haoyu Wang
71
4
0
09 Aug 2024
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Sheridan Feucht
David Atkinson
Byron C. Wallace
David Bau
104
8
0
28 Jun 2024
Adaptive Reinforcement Learning Planning: Harnessing Large Language
  Models for Complex Information Extraction
Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction
Zepeng Ding
Ruiyang Ke
Wenhao Huang
Guochao Jiang
Yanda Li
Deqing Yang
Jiaqing Liang
89
1
0
17 Jun 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in
  Large Language Models
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Sander Land
Max Bartolo
116
25
0
08 May 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLMLRM
315
576
0
07 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large
  Language Models
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Arijit Ghosh Chowdhury
Md. Mofijul Islam
Vaibhav Kumar
F. H. Shezan
Vaibhav Kumar
Vinija Jain
Aman Chadha
AAMLPILM
92
34
0
03 Mar 2024
A Comprehensive Survey of Attack Techniques, Implementation, and
  Mitigation Strategies in Large Language Models
A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
Aysan Esmradi
Daniel Wankit Yip
C. Chan
AAML
83
14
0
18 Dec 2023
Qwen Technical Report
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
342
1,922
0
28 Sep 2023
Baichuan 2: Open Large-scale Language Models
Baichuan 2: Open Large-scale Language Models
Ai Ming Yang
Bin Xiao
Bingning Wang
Borong Zhang
Ce Bian
...
Youxin Jiang
Yuchen Gao
Yupeng Zhang
Guosheng Dong
Zhiying Wu
ELMLRM
330
755
0
19 Sep 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
510
12,128
0
18 Jul 2023
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction
  Tuning for Large Language Models
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
Lyne Tchapmi
Mingyu Derek Ma
Fei Wang
Chaowei Xiao
Muhao Chen
SILM
137
85
0
24 May 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.6K
14,832
0
15 Mar 2023
ChatGPT Participates in a Computer Science Exam
ChatGPT Participates in a Computer Science Exam
Sebastian Bordt
U. V. Luxburg
ELM
90
41
0
08 Mar 2023
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Ruixiang Tang
Xiaotian Han
Xiaoqian Jiang
Helen Zhou
LM&MAAI4MHSyDa
101
186
0
08 Mar 2023
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
  Reasoning, Hallucination, and Interactivity
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Yejin Bang
Samuel Cahyawijaya
Nayeon Lee
Wenliang Dai
Dan Su
...
Tiezheng Yu
Willy Chung
Quyet V. Do
Yan Xu
Pascale Fung
ReLMLRM
164
1,400
0
08 Feb 2023
ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on
  Simplified Radiology Reports
ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports
Katharina Jeblick
B. Schachtner
Jakob Dexl
Andreas Mittermeier
Anna Theresa Stüber
...
Tobias Weber
Philipp Wesp
B. Sabel
J. Ricke
Michael Ingrisch
LM&MAMedIm
176
403
0
30 Dec 2022
ChatGPT: The End of Online Exam Integrity?
ChatGPT: The End of Online Exam Integrity?
Teo Susnjak
DeLMOELM
83
351
0
19 Dec 2022
Legal Prompting: Teaching a Language Model to Think Like a Lawyer
Legal Prompting: Teaching a Language Model to Think Like a Lawyer
Fang Yu
Lee Quartey
Frank Schilder
ELMLRM
54
69
0
02 Dec 2022
Galactica: A Large Language Model for Science
Galactica: A Large Language Model for Science
Ross Taylor
Marcin Kardas
Guillem Cucurull
Thomas Scialom
Anthony Hartshorn
Elvis Saravia
Andrew Poulton
Viktor Kerkez
Robert Stojnic
ELMReLM
128
785
0
16 Nov 2022
GLM-130B: An Open Bilingual Pre-trained Model
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng Zhang
Yuxiao Dong
Jie Tang
BDLLRM
386
1,101
0
05 Oct 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial
  Intelligence with Humans
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans
John J. Nay
ELMAILaw
190
29
0
14 Sep 2022
Prompt Injection: Parameterization of Fixed Inputs
Prompt Injection: Parameterization of Fixed Inputs
Eunbi Choi
Yongrae Jo
Joel Jang
Minjoon Seo
119
30
0
31 May 2022
Large Language Models are Zero-Shot Reasoners
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLMLRM
596
4,540
0
24 May 2022
Least-to-Most Prompting Enables Complex Reasoning in Large Language
  Models
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou
Nathanael Scharli
Le Hou
Jason W. Wei
Nathan Scales
...
Dale Schuurmans
Claire Cui
Olivier Bousquet
Quoc Le
Ed H. Chi
RALMLRMAI4CE
109
1,138
0
21 May 2022
PaLM: Scaling Language Modeling with Pathways
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILMLRM
570
6,320
0
05 Apr 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
217
1,992
0
29 Mar 2022
CodeGen: An Open Large Language Model for Code with Multi-Turn Program
  Synthesis
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
Erik Nijkamp
Bo Pang
Hiroaki Hayashi
Lifu Tu
Haiquan Wang
Yingbo Zhou
Silvio Savarese
Caiming Xiong
ELM
181
1,054
0
25 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
999
9,796
0
28 Jan 2022
TruthfulQA: Measuring How Models Mimic Human Falsehoods
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie C. Lin
Jacob Hilton
Owain Evans
HILM
151
1,953
0
08 Sep 2021
Knowledge Neurons in Pretrained Transformers
Knowledge Neurons in Pretrained Transformers
Damai Dai
Li Dong
Y. Hao
Zhifang Sui
Baobao Chang
Furu Wei
KELMMU
155
466
0
18 Apr 2021
Superbizarre Is Not Superb: Derivational Morphology Improves BERT's
  Interpretation of Complex Words
Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words
Valentin Hofmann
J. Pierrehumbert
Hinrich Schütze
118
72
0
02 Jan 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
Basel Alomair
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAUSILM
562
1,964
0
14 Dec 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
1.1K
42,651
0
28 May 2020
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
Language (Technology) is Power: A Critical Survey of "Bias" in NLP
Su Lin Blodgett
Solon Barocas
Hal Daumé
Hanna M. Wallach
159
1,257
0
28 May 2020
Multilingual Denoising Pre-training for Neural Machine Translation
Multilingual Denoising Pre-training for Neural Machine Translation
Yinhan Liu
Jiatao Gu
Naman Goyal
Xian Li
Sergey Edunov
Marjan Ghazvininejad
M. Lewis
Luke Zettlemoyer
AI4CEAIMat
128
1,818
0
22 Jan 2020
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
  Generation, Translation, and Comprehension
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
M. Lewis
Yinhan Liu
Naman Goyal
Marjan Ghazvininejad
Abdel-rahman Mohamed
Omer Levy
Veselin Stoyanov
Luke Zettlemoyer
AIMatVLM
268
10,913
0
29 Oct 2019
Thieves on Sesame Street! Model Extraction of BERT-based APIs
Thieves on Sesame Street! Model Extraction of BERT-based APIs
Kalpesh Krishna
Gaurav Singh Tomar
Ankur P. Parikh
Nicolas Papernot
Mohit Iyyer
MIACVMLAU
154
201
0
27 Oct 2019
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and
  lighter
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh
Lysandre Debut
Julien Chaumond
Thomas Wolf
313
7,575
0
02 Oct 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language
  Representations
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Kevin Gimpel
Piyush Sharma
Radu Soricut
SSLAIMat
504
6,482
0
26 Sep 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
794
24,615
0
26 Jul 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language
  Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova
VLMSSLSSeg
1.9K
95,554
0
11 Oct 2018
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Taku Kudo
John Richardson
283
3,537
0
19 Aug 2018
12
Next