ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2012.15613
  4. Cited By
How Good is Your Tokenizer? On the Monolingual Performance of
  Multilingual Language Models

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

31 December 2020
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
ArXivPDFHTML

Papers citing "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

50 / 64 papers shown
Title
Token-free Models for Sarcasm Detection
Token-free Models for Sarcasm Detection
Sumit Mamtani
Maitreya Sonawane
Kanika Agarwal
Nishanth Sanjeev
36
0
0
02 May 2025
Modes of Sequence Models and Learning Coefficients
Modes of Sequence Models and Learning Coefficients
Zhongtian Chen
Daniel Murfet
84
1
0
25 Apr 2025
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Roussel Rahman
ReLM
ELM
LRM
46
0
0
31 Mar 2025
Cross-Tokenizer Distillation via Approximate Likelihood Matching
Cross-Tokenizer Distillation via Approximate Likelihood Matching
Benjamin Minixhofer
Ivan Vulić
E. Ponti
145
0
0
25 Mar 2025
Do Multilingual LLMs Think In English?
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
44
3
0
24 Feb 2025
Tokenization is Sensitive to Language Variation
Tokenization is Sensitive to Language Variation
Anna Wegmann
Dong Nguyen
David Jurgens
77
1
0
24 Feb 2025
Optimal word order for non-causal text generation with Large Language Models: the Spanish case
Optimal word order for non-causal text generation with Large Language Models: the Spanish case
Andrea Busto-Castiñeira
Silvia García-Méndez
Francisco de Arriba-Pérez
Francisco J. González Castaño
41
0
0
21 Feb 2025
Facilitating large language model Russian adaptation with Learned Embedding Propagation
Facilitating large language model Russian adaptation with Learned Embedding Propagation
Mikhail Tikhomirov
D. Chernyshev
38
1
0
31 Dec 2024
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
Langlin Huang
Mengyu Bu
Yang Feng
28
0
0
03 Nov 2024
DEPT: Decoupled Embeddings for Pre-training Language Models
DEPT: Decoupled Embeddings for Pre-training Language Models
Alex Iacob
Lorenzo Sani
Meghdad Kurmanji
William F. Shen
Xinchi Qiu
Dongqi Cai
Yan Gao
Nicholas D. Lane
VLM
139
0
0
07 Oct 2024
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
121
0
0
19 Sep 2024
A Principled Framework for Evaluating on Typologically Diverse Languages
A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger
Wessel Poelman
Andreas Holck Høeg-Petersen
Anders Schlichtkrull
Miryam de Lhoneux
Johannes Bjerva
36
1
0
06 Jul 2024
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
27
27
0
04 Jul 2024
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Prashant Kodali
Anmol Goel
Likhith Asapu
Vamshi Bonagiri
Anirudh Govil
Monojit Choudhury
Manish Shrivastava
Ponnurangam Kumaraguru
42
0
0
09 May 2024
Unknown Script: Impact of Script on Cross-Lingual Transfer
Unknown Script: Impact of Script on Cross-Lingual Transfer
Wondimagegnhue Tufa
Ilia Markov
Piek Vossen
37
0
0
29 Apr 2024
A Systematic Analysis of Subwords and Cross-Lingual Transfer in
  Multilingual Translation
A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation
Francois Meyer
Jan Buys
29
1
0
29 Mar 2024
On the Challenges and Opportunities in Generative AI
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Robert Bamler
Ryan Cotterell
Sina Daubener
...
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
Vincent Fortuin
56
17
0
28 Feb 2024
The Impact of Word Splitting on the Semantic Content of Contextualized
  Word Representations
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
Aina Garí Soler
Matthieu Labeau
Chloé Clavel
VLM
34
2
0
22 Feb 2024
CroissantLLM: A Truly Bilingual French-English Language Model
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse
Patrick Fernandes
Nuno M. Guerreiro
António Loison
Duarte M. Alves
...
François Yvon
André F.T. Martins
Gautier Viaud
C´eline Hudelot
Pierre Colombo
47
32
0
01 Feb 2024
When Is Multilinguality a Curse? Language Modeling for 250 High- and
  Low-Resource Languages
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Tyler A. Chang
Catherine Arnett
Zhuowen Tu
Benjamin Bergen
LRM
30
7
0
15 Nov 2023
MEGAVERSE: Benchmarking Large Language Models Across Languages,
  Modalities, Models and Tasks
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Sanchit Ahuja
Divyanshu Aggarwal
Varun Gumma
Ishaan Watts
Ashutosh Sathe
...
Rishav Hada
Prachi Jain
Maxamed Axmed
Kalika Bali
Sunayana Sitaram
ELM
34
39
0
13 Nov 2023
Analyzing Cognitive Plausibility of Subword Tokenization
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
29
17
0
20 Oct 2023
A Systematic Study of Performance Disparities in Multilingual
  Task-Oriented Dialogue Systems
A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems
Songbo Hu
Han Zhou
Moy Yuan
Milan Gritta
Guchun Zhang
Ignacio Iacobacci
Anna Korhonen
Ivan Vulić
30
3
0
19 Oct 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model
  Pretraining on Source Code
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Nadezhda Chirkova
Sergey Troshin
21
8
0
01 Aug 2023
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for
  Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems
Songbo Hu
Han Zhou
Mete Hergul
Milan Gritta
Guchun Zhang
Ignacio Iacobacci
Ivan Vulić
Anna Korhonen
28
10
0
26 Jul 2023
Combating the Curse of Multilinguality in Cross-Lingual WSD by Aligning
  Sparse Contextualized Word Representations
Combating the Curse of Multilinguality in Cross-Lingual WSD by Aligning Sparse Contextualized Word Representations
Gábor Berend
30
7
0
25 Jul 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial
  Language Models
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
40
80
0
23 May 2023
Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis
Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis
Seraphina Goldfarb-Tarrant
Bjorn Ross
Adam Lopez
33
7
0
22 May 2023
Comparing Biases and the Impact of Multilingual Training across Multiple
  Languages
Comparing Biases and the Impact of Multilingual Training across Multiple Languages
Sharon Levy
Neha Ann John
Ling Liu
Yogarshi Vyas
Jie Ma
Yoshinari Fujinuma
Miguel Ballesteros
Vittorio Castelli
Dan Roth
26
25
0
18 May 2023
Language Model Tokenizers Introduce Unfairness Between Languages
Language Model Tokenizers Introduce Unfairness Between Languages
Aleksandar Petrov
Emanuele La Malfa
Philip H. S. Torr
Adel Bibi
16
96
0
17 May 2023
SwissBERT: The Multilingual Language Model for Switzerland
SwissBERT: The Multilingual Language Model for Switzerland
Jannis Vamvas
Johannes Graen
Rico Sennrich
35
6
0
23 Mar 2023
Revealing Weaknesses of Vietnamese Language Models Through Unanswerable
  Questions in Machine Reading Comprehension
Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension
Son Quoc Tran
Phong Nguyen-Thuan Do
Kiet Van Nguyen
N. Nguyen
37
0
0
16 Mar 2023
MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code
  Completion
MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion
Zi Gong
Yinpeng Guo
Pingyi Zhou
Cuiyun Gao
Yasheng Wang
Zenglin Xu
12
8
0
19 Dec 2022
Punctuation Restoration for Singaporean Spoken Languages: English,
  Malay, and Mandarin
Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin
Abhinav Rao
Ho Thi-Nga
Chng Eng Siong
19
3
0
10 Dec 2022
Understanding BLOOM: An empirical study on diverse NLP tasks
Understanding BLOOM: An empirical study on diverse NLP tasks
Parag Dakle
Sai Krishna Rallabandi
Preethi Raghavan
AI4CE
31
3
0
27 Nov 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
101
2,306
0
09 Nov 2022
Incorporating Context into Subword Vocabularies
Incorporating Context into Subword Vocabularies
Shaked Yehezkel
Yuval Pinter
39
8
0
13 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for
  Vision and Language Tasks
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
28
1
0
12 Oct 2022
MonoByte: A Pool of Monolingual Byte-level Language Models
MonoByte: A Pool of Monolingual Byte-level Language Models
Hugo Queiroz Abonizio
Leandro Rodrigues de Souza
R. Lotufo
Rodrigo Nogueira
23
1
0
22 Sep 2022
Language Modelling with Pixels
Language Modelling with Pixels
Phillip Rust
Jonas F. Lotz
Emanuele Bugliarello
Elizabeth Salesky
Miryam de Lhoneux
Desmond Elliott
VLM
32
46
0
14 Jul 2022
hmBERT: Historical Multilingual Language Models for Named Entity
  Recognition
hmBERT: Historical Multilingual Language Models for Named Entity Recognition
Stefan Schweter
Luisa März
Katharina Schmid
Erion cCano
38
18
0
31 May 2022
Lifting the Curse of Multilinguality by Pre-training Modular
  Transformers
Lifting the Curse of Multilinguality by Pre-training Modular Transformers
Jonas Pfeiffer
Naman Goyal
Xi Victoria Lin
Xian Li
James Cross
Sebastian Riedel
Mikel Artetxe
LRM
40
139
0
12 May 2022
How Robust is Neural Machine Translation to Language Imbalance in
  Multilingual Tokenizer Training?
How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
Shiyue Zhang
Vishrav Chaudhary
Naman Goyal
James Cross
Guillaume Wenzek
Mohit Bansal
Francisco Guzman
31
16
0
29 Apr 2022
You Are What You Write: Preserving Privacy in the Era of Large Language
  Models
You Are What You Write: Preserving Privacy in the Era of Large Language Models
Richard Plant
V. Giuffrida
Dimitra Gkatzia
PILM
20
19
0
20 Apr 2022
mGPT: Few-Shot Learners Go Multilingual
mGPT: Few-Shot Learners Go Multilingual
Oleh Shliazhko
Alena Fenogenova
Maria Tikhonova
Vladislav Mikhailov
Anastasia Kozlova
Tatiana Shavrina
43
149
0
15 Apr 2022
IT5: Text-to-text Pretraining for Italian Language Understanding and
  Generation
IT5: Text-to-text Pretraining for Italian Language Understanding and Generation
Gabriele Sarti
Malvina Nissim
AILaw
10
42
0
07 Mar 2022
Oolong: Investigating What Makes Transfer Learning Hard with Controlled
  Studies
Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies
Zhengxuan Wu
Alex Tamkin
Isabel Papadimitriou
21
10
0
24 Feb 2022
Between words and characters: A Brief History of Open-Vocabulary
  Modeling and Tokenization in NLP
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
Sabrina J. Mielke
Zaid Alyafeai
Elizabeth Salesky
Colin Raffel
Manan Dey
...
Arun Raja
Chenglei Si
Wilson Y. Lee
Benoît Sagot
Samson Tan
30
140
0
20 Dec 2021
WECHSEL: Effective initialization of subword embeddings for
  cross-lingual transfer of monolingual language models
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models
Benjamin Minixhofer
Fabian Paischer
Navid Rekabsaz
22
73
0
13 Dec 2021
To Augment or Not to Augment? A Comparative Study on Text Augmentation
  Techniques for Low-Resource NLP
To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP
Gözde Gül Sahin
30
33
0
18 Nov 2021
12
Next