Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.09572
Cited By
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
16 June 2023
T. Fujii
Koki Shibata
Atsuki Yamaguchi
Terufumi Morishita
Yasuhiro Sogawa
Re-assign community
ArXiv
PDF
HTML
Papers citing
"How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese"
8 / 8 papers shown
Title
Overcoming Vocabulary Constraints with Pixel-level Fallback
Jonas F. Lotz
Hendra Setiawan
Stephan Peitz
Yova Kementchedjhieva
43
0
0
02 Apr 2025
Efficient Continual Pre-training of LLMs for Low-resource Languages
Arijit Nag
Soumen Chakrabarti
Animesh Mukherjee
Niloy Ganguly
82
0
0
13 Dec 2024
Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact
Junhua Liu
Bin Fu
LRM
31
1
0
23 Oct 2024
Efficacy of ByT5 in Multilingual Translation of Biblical Texts for Underrepresented Languages
Corinne Aars
Lauren Adams
Xiaokan Tian
Zhaoyu Wang
Colton Wismer
Jason Wu
Pablo Rivas
Korn Sooksatra
Matthew Fendt
17
0
0
22 May 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Hai-Tao Zheng
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
55
36
0
07 Apr 2024
How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak
Adrien Bazoge
B. Daille
Mickael Rouvier
Richard Dufour
41
1
0
22 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
27
7
0
16 Feb 2024
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
80
235
0
31 Dec 2020
1