ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.16842
  4. Cited By
Tokenization and the Noiseless Channel

Tokenization and the Noiseless Channel

29 June 2023
Vilém Zouhar
Clara Meister
Juan Luis Gastaldi
Li Du
Mrinmaya Sachan
Ryan Cotterell
ArXivPDFHTML

Papers citing "Tokenization and the Noiseless Channel"

24 / 24 papers shown
Title
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Nirvan Patil
Malhar Abhay Inamdar
Agnivo Gosai
Guruprasad Pathak
Anish Joshi
Aryan Sagavekar
Anish Joshirao
Raj Abhijit Dandekar
Rajat Dandekar
Sreedath Panat
50
0
0
07 Apr 2025
Splintering Nonconcatenative Languages for Better Tokenization
Splintering Nonconcatenative Languages for Better Tokenization
Bar Gazit
Shaltiel Shmidman
Avi Shmidman
Yuval Pinter
70
0
0
18 Mar 2025
Tokenization is Sensitive to Language Variation
Tokenization is Sensitive to Language Variation
Anna Wegmann
Dong Nguyen
David Jurgens
99
1
0
24 Feb 2025
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword
  Tokenization
Team Ryu's Submission to SIGMORPHON 2024 Shared Task on Subword Tokenization
Zilong Li
37
0
0
19 Oct 2024
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Buu Phan
Brandon Amos
Itai Gat
Marton Havasi
Matthew Muckley
Karen Ullrich
66
1
0
11 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
64
14
0
08 Oct 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
53
3
0
06 Sep 2024
Distributional Properties of Subword Regularization
Distributional Properties of Subword Regularization
Marco Cognetta
Vilém Zouhar
Naoaki Okazaki
48
0
0
21 Aug 2024
Understanding and Mitigating Tokenization Bias in Language Models
Understanding and Mitigating Tokenization Bias in Language Models
Buu Phan
Marton Havasi
Matthew Muckley
Karen Ullrich
59
5
0
24 Jun 2024
How to Compute the Probability of a Word
How to Compute the Probability of a Word
Tiago Pimentel
Clara Meister
60
14
0
20 Jun 2024
Lexically Grounded Subword Segmentation
Lexically Grounded Subword Segmentation
Jindřich Libovický
Jindřich Helcl
91
1
0
19 Jun 2024
Evaluating Subword Tokenization: Alien Subword Composition and OOV
  Generalization Challenge
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Khuyagbaatar Batsuren
Ekaterina Vylomova
Verna Dankers
Tsetsuukhei Delgerbaatar
Omri Uzan
Yuval Pinter
Gábor Bella
45
10
0
20 Apr 2024
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Marco Cognetta
Tatsuya Hiraoka
Naoaki Okazaki
Rico Sennrich
Yuval Pinter
66
2
0
30 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan
Craig W. Schmidt
Chris Tanner
Yuval Pinter
51
14
0
02 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent
  on Language Models
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
51
26
0
29 Feb 2024
Tokenization Is More Than Compression
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
61
30
0
28 Feb 2024
Subobject-level Image Tokenization
Subobject-level Image Tokenization
Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung
VLM
OCL
71
8
0
22 Feb 2024
Stolen Subwords: Importance of Vocabularies for Machine Translation
  Model Stealing
Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing
Vilém Zouhar
AAML
45
0
0
29 Jan 2024
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy
  in Mental Health and Beyond
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu
Naihao Deng
Sahand Sabour
Yilin Jia
Minlie Huang
Rada Mihalcea
43
19
0
09 Oct 2023
Scaling Experiments in Self-Supervised Cross-Table Representation
  Learning
Scaling Experiments in Self-Supervised Cross-Table Representation Learning
Maximilian Schambach
Dominique Paul
Wei Le
LMTD
38
2
0
29 Sep 2023
Headless Language Models: Learning without Predicting with Contrastive
  Weight Tying
Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Nathan Godey
Eric Villemonte de la Clergerie
Benoît Sagot
47
3
0
15 Sep 2023
A Formal Perspective on Byte-Pair Encoding
A Formal Perspective on Byte-Pair Encoding
Vilém Zouhar
Clara Meister
Juan Luis Gastaldi
Li Du
Tim Vieira
Mrinmaya Sachan
Ryan Cotterell
36
27
0
29 Jun 2023
Tokenization Preference for Human and Machine Learning Model: An
  Annotation Study
Tokenization Preference for Human and Machine Learning Model: An Annotation Study
Tatsuya Hiraoka
Tomoya Iwakura
37
1
0
21 Apr 2023
Checks and Strategies for Enabling Code-Switched Machine Translation
Checks and Strategies for Enabling Code-Switched Machine Translation
Thamme Gowda
Mozhdeh Gheini
Jonathan May
43
3
0
11 Oct 2022
1