How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

16 June 2023

Papers citing "How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese"

8 / 8 papers shown

Title
Overcoming Vocabulary Constraints with Pixel-level Fallback Jonas F. Lotz Hendra Setiawan Stephan Peitz Yova Kementchedjhieva 43 0 0 02 Apr 2025
Efficient Continual Pre-training of LLMs for Low-resource Languages Arijit Nag Soumen Chakrabarti Animesh Mukherjee Niloy Ganguly 82 0 0 13 Dec 2024
Responsible Multilingual Large Language Models: A Survey of Development, Applications, and Societal Impact Junhua Liu Bin Fu LRM 31 1 0 23 Oct 2024
Efficacy of ByT5 in Multilingual Translation of Biblical Texts for Underrepresented Languages Corinne Aars Lauren Adams Xiaokan Tian Zhaoyu Wang Colton Wismer Jason Wu Pablo Rivas Korn Sooksatra Matthew Fendt 17 0 0 22 May 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers Libo Qin Qiguang Chen Yuhang Zhou Zhi Chen Hai-Tao Zheng Lizi Liao Min Li Wanxiang Che Philip S. Yu LRM 55 36 0 07 Apr 2024
How Important Is Tokenization in French Medical Masked Language Models? Yanis Labrak Adrien Bazoge B. Daille Mickael Rouvier Richard Dufour 41 1 0 22 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference Atsuki Yamaguchi Aline Villavicencio Nikolaos Aletras 27 7 0 16 Feb 2024
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 80 235 0 31 Dec 2020