Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.15425
Cited By
Language Model Tokenizers Introduce Unfairness Between Languages
17 May 2023
Aleksandar Petrov
Emanuele La Malfa
Philip Torr
Adel Bibi
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Language Model Tokenizers Introduce Unfairness Between Languages"
50 / 78 papers shown
Title
Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong
Muhammad Farid Adilazuarda
Jonibek Mansurov
Ruochen Zhang
Niklas Muennighoff
Carsten Eickhoff
Genta Indra Winata
Julia Kreutzer
Stephen H. Bach
Alham Fikri Aji
LRM
ELM
178
0
0
08 May 2025
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Monojit Choudhury
Shivam Chauhan
Rocktim Jyoti Das
Dhruv Sahnan
Xudong Han
...
Rituraj Joshi
Gurpreet Gosal
Avraham Sheinin
Natalia Vassilieva
Preslav Nakov
33
0
0
08 Apr 2025
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Nirvan Patil
Malhar Abhay Inamdar
Agnivo Gosai
Guruprasad Pathak
Anish Joshi
Aryan Sagavekar
Anish Joshirao
Raj Abhijit Dandekar
Rajat Dandekar
Sreedath Panat
46
0
0
07 Apr 2025
Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata
Adrien Schurger-Foy
Rafal Kocielnik
Caglar Gulcehre
R. Alvarez
45
0
0
02 Apr 2025
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Roussel Rahman
ReLM
ELM
LRM
46
0
0
31 Mar 2025
HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery
Jingtao Li
Y. Liu
Xinyu Wang
Yunning Peng
Chen Sun
...
Tian Ke
Xiao Jiang
Tangwei Lu
Anran Zhao
Yanfei Zhong
VLM
60
0
0
27 Mar 2025
Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?
Aabid Karim
Abdul Karim
Bhoomika Lohana
Matt Keon
Jaswinder Singh
A. Sattar
52
0
0
23 Mar 2025
Adversarial Tokenization
Renato Lui Geh
Zilei Shao
Mathias Niepert
SILM
AAML
87
0
0
04 Mar 2025
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Fajri Koto
Rituraj Joshi
Nurdaulet Mukhituly
Yanjie Wang
Zhuohan Xie
...
Avraham Sheinin
Natalia Vassilieva
Neha Sengupta
Larry Murray
Preslav Nakov
ALM
KELM
43
0
0
03 Mar 2025
Tokenization is Sensitive to Language Variation
Anna Wegmann
Dong Nguyen
David Jurgens
84
1
0
24 Feb 2025
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
44
3
0
24 Feb 2025
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Somnath Banerjee
Sayan Layek
Pratyush Chatterjee
Animesh Mukherjee
Rima Hazra
LLMSV
76
0
0
16 Feb 2025
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
Archchana Sindhujan
Diptesh Kanojia
Constantin Orasan
Shenbin Qian
38
2
0
08 Jan 2025
Efficient Continual Pre-training of LLMs for Low-resource Languages
Arijit Nag
Soumen Chakrabarti
Animesh Mukherjee
Niloy Ganguly
82
0
0
13 Dec 2024
Efficient Online Inference of Vision Transformers by Training-Free Tokenization
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
93
0
0
23 Nov 2024
Software Performance Engineering for Foundation Model-Powered Software (FMware)
Haoxiang Zhang
Shi Chang
Arthur Leung
Kishanthan Thangarajah
Boyuan Chen
Hanan Lutfiyya
Ahmed E. Hassan
116
1
0
14 Nov 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
40
2
0
28 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
33
1
0
17 Oct 2024
One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
Furu Wei
Furu Wei
51
2
0
14 Oct 2024
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework
Zhengwei Yang
Yuke Li
Qiang Sun
Basura Fernando
Heng-Chiao Huang
Zheng Wang
32
1
0
14 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
37
2
0
12 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
48
12
0
08 Oct 2024
Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)
Abrar Rahman
Garry Bowlin
Binit Mohanty
Sean McGunigal
26
0
0
04 Oct 2024
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
Menan Velayuthan
Kengatharaiyer Sarveswaran
40
5
0
17 Sep 2024
ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models
Inhwa Song
SoHyun Park
Sachin R. Pendse
J. Schleider
Munmun De Choudhury
Young-Ho Kim
29
0
0
15 Sep 2024
Where is the signal in tokenization space?
Renato Lui Geh
Honghua Zhang
Kareem Ahmed
Benjie Wang
Mathias Niepert
30
4
0
16 Aug 2024
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Valentin Hoffman
Tomasz Limisiewicz
Yulia Tsvetkov
Noah A. Smith
51
4
0
11 Jul 2024
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
Shaltiel Shmidman
Avi Shmidman
Amir DN Cohen
Moshe Koppel
40
2
0
09 Jul 2024
A Principled Framework for Evaluating on Typologically Diverse Languages
Esther Ploeger
Wessel Poelman
Andreas Holck Høeg-Petersen
Anders Schlichtkrull
Miryam de Lhoneux
Johannes Bjerva
36
1
0
06 Jul 2024
SignCLIP: Connecting Text and Sign Language by Contrastive Learning
Zifan Jiang
Gerard Sant
Amit Moryossef
Mathias Müller
Rico Sennrich
Sarah Ebling
VLM
CLIP
42
2
0
01 Jul 2024
Understanding and Mitigating Tokenization Bias in Language Models
Buu Phan
Marton Havasi
Matthew Muckley
Karen Ullrich
44
3
0
24 Jun 2024
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Euiin Yi
Taehyeon Kim
Hongseok Jeung
Du-Seong Chang
Se-Young Yun
48
4
0
24 Jun 2024
Decoding the Diversity: A Review of the Indic AI Research Landscape
Sankalp KJ
Vinija Jain
S. Bhaduri
Tamoghna Roy
Aman Chadha
55
5
0
13 Jun 2024
Investigating the translation capabilities of Large Language Models trained on parallel data only
Javier García Gilabert
Carlos Escolano
Aleix Sant Savall
Francesca de Luca Fornaciari
Audrey Mash
Xixian Liao
Maite Melero
LRM
42
2
0
13 Jun 2024
Guiding In-Context Learning of LLMs through Quality Estimation for Machine Translation
Javad Pourmostafa Roshan Sharami
D. Shterionov
Pieter Spronck
28
0
0
12 Jun 2024
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Andrew M. Bean
Simi Hellsten
Harry Mayne
Jabez Magomere
Ethan A. Chi
Ryan A. Chi
Scott A. Hale
Hannah Rose Kirk
ELM
LRM
42
7
0
10 Jun 2024
How Multilingual Are Large Language Models Fine-Tuned for Translation?
Aquia Richburg
Marine Carpuat
LRM
41
4
0
30 May 2024
Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksander Petrov
Bertie Vidgen
Christian Schroeder
Fabio Pizzati
...
Matthew Jackson
Phillip H. S. Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
48
18
0
14 May 2024
Zero-Shot Tokenizer Transfer
Benjamin Minixhofer
E. Ponti
Ivan Vulić
VLM
44
9
0
13 May 2024
Natural Language Processing RELIES on Linguistics
Juri Opitz
Shira Wein
Nathan Schneider
AI4CE
55
7
0
09 May 2024
Near to Mid-term Risks and Opportunities of Open-Source Generative AI
Francisco Eiras
Aleksandar Petrov
Bertie Vidgen
Christian Schroeder de Witt
Fabio Pizzati
...
Paul Röttger
Philip Torr
Trevor Darrell
Y. Lee
Jakob N. Foerster
46
6
0
25 Apr 2024
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
Jiachen Liu
Zhiyu Wu
Jae-Won Chung
Fan Lai
Myungjin Lee
Mosharaf Chowdhury
53
26
0
25 Apr 2024
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Kevin Slagle
37
3
0
22 Apr 2024
Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers
Libo Qin
Qiguang Chen
Yuhang Zhou
Zhi Chen
Hai-Tao Zheng
Lizi Liao
Min Li
Wanxiang Che
Philip S. Yu
LRM
55
36
0
07 Apr 2024
HyperCLOVA X Technical Report
Kang Min Yoo
Jaegeun Han
Sookyo In
Heewon Jeon
Jisu Jeong
...
Hyunkyung Noh
Se-Eun Choi
Sang-Woo Lee
Jung Hwa Lim
Nako Sung
VLM
37
8
0
02 Apr 2024
Poro 34B and the Blessing of Multilinguality
Risto Luukkonen
Jonathan Burdge
Elaine Zosa
Aarne Talman
Ville Komulainen
Vaino Hatanpaa
Peter Sarlin
S. Pyysalo
AI4CE
50
12
0
02 Apr 2024
BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English
H. M. Q. H. Sheikh Shafayat
Rishav Hada
Isaac Cowhey
Rifki Afina
Jerry Tworek
Lorie De Leon
35
3
0
16 Mar 2024
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs
Arijit Nag
Animesh Mukherjee
Niloy Ganguly
Soumen Chakrabarti
43
2
0
08 Mar 2024
Did Translation Models Get More Robust Without Anyone Even Noticing?
Ben Peters
André F. T. Martins
36
3
0
06 Mar 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
Carolin Holtermann
Paul Röttger
Timm Dill
Anne Lauscher
ELM
LRM
40
22
0
06 Mar 2024
1
2
Next