ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.15425
  4. Cited By
Language Model Tokenizers Introduce Unfairness Between Languages

Language Model Tokenizers Introduce Unfairness Between Languages

17 May 2023
Aleksandar Petrov
Emanuele La Malfa
Philip Torr
Adel Bibi
ArXivPDFHTML

Papers citing "Language Model Tokenizers Introduce Unfairness Between Languages"

28 / 78 papers shown
Title
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across
  Languages
A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages
Catherine Arnett
Tyler A. Chang
Benjamin Bergen
29
3
0
01 Mar 2024
On the Challenges and Opportunities in Generative AI
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Robert Bamler
Ryan Cotterell
Sina Daubener
...
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
Vincent Fortuin
56
17
0
28 Feb 2024
Efficient and Effective Vocabulary Expansion Towards Multilingual Large
  Language Models
Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models
Seungduk Kim
Seungtaek Choi
Myeongho Jeong
38
6
0
22 Feb 2024
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand
  for Multilingual Instructions?
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?
Alexander Arno Weber
Klaudia Thellmann
Jan Ebert
Nicolas Flores-Herr
Jens Lehmann
Michael Fromm
Mehdi Ali
38
4
0
21 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient
  Language Model Inference
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
27
7
0
16 Feb 2024
Multi-word Tokenization for Sequence Compression
Multi-word Tokenization for Sequence Compression
Leonidas Gee
Leonardo Rigutini
Marco Ernandes
Andrea Zugarini
18
8
0
15 Feb 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip Torr
Lewis Hammond
Christian Schroeder de Witt
48
4
0
12 Feb 2024
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
  & Attribution in AI
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
Shayne Longpre
Robert Mahari
Anthony Chen
Naana Obeng-Marnu
Damien Sileo
...
K. Bollacker
Tongshuang Wu
Luis Villa
Sandy Pentland
Sara Hooker
20
56
0
25 Oct 2023
Establishing Vocabulary Tests as a Benchmark for Evaluating Large
  Language Models
Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models
Gonzalo Martínez
Javier Conde
Elena Merino-Gómez
Beatriz Bermúdez-Margaretto
José Alberto Hernández
Pedro Reviriego
Marc Brysbaert
ELM
25
1
0
23 Oct 2023
A Comprehensive Evaluation of Large Language Models on Legal Judgment
  Prediction
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction
Ruihao Shui
Yixin Cao
Xiang Wang
Tat-Seng Chua
ELM
AILaw
11
19
0
18 Oct 2023
Core Building Blocks: Next Gen Geo Spatial GPT Application
Core Building Blocks: Next Gen Geo Spatial GPT Application
Ashley Fernandez
Swaraj Dube
24
5
0
17 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
23
47
0
12 Oct 2023
A Benchmark for Learning to Translate a New Language from One Grammar
  Book
A Benchmark for Learning to Translate a New Language from One Grammar Book
Garrett Tanzer
Mirac Suzgun
Chenguang Xi
Dan Jurafsky
Luke Melas-Kyriazi
24
51
0
28 Sep 2023
Language Models as a Service: Overview of a New Paradigm and its
  Challenges
Language Models as a Service: Overview of a New Paradigm and its Challenges
Emanuele La Malfa
Aleksandar Petrov
Simon Frieder
Christoph Weinhuber
Ryan Burnell
Raza Nazar
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
ALM
ELM
35
3
0
28 Sep 2023
GlotScript: A Resource and Tool for Low Resource Writing System
  Identification
GlotScript: A Resource and Tool for Low Resource Writing System Identification
Amir Hossein Kargaran
François Yvon
Hinrich Schütze
13
10
0
23 Sep 2023
ChaCha: Leveraging Large Language Models to Prompt Children to Share
  Their Emotions about Personal Events
ChaCha: Leveraging Large Language Models to Prompt Children to Share Their Emotions about Personal Events
Woosuk Seo
Chanmo Yang
Young-Ho Kim
10
39
0
21 Sep 2023
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
  Generative Large Language Models
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Neha Sengupta
Sunil Kumar Sahu
Bokang Jia
Satheesh Katipomu
Haonan Li
...
A. Jackson
Hector Xuguang Ren
Preslav Nakov
Timothy Baldwin
Eric P. Xing
LRM
29
40
0
30 Aug 2023
Sparks of Large Audio Models: A Survey and Outlook
Sparks of Large Audio Models: A Survey and Outlook
S. Latif
Moazzam Shoukat
Fahad Shamshad
Muhammad Usama
Yi Ren
...
Wenwu Wang
Xulong Zhang
Roberto Togneri
Min Zhang
Björn W. Schuller
LM&MA
AuLLM
33
38
0
24 Aug 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial
  Language Models
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
51
82
0
23 May 2023
On The Computational Complexity of Self-Attention
On The Computational Complexity of Self-Attention
Feyza Duman Keles
Pruthuvi Maheshakya Wijewardena
C. Hegde
73
109
0
11 Sep 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
339
12,003
0
04 Mar 2022
Whose Language Counts as High Quality? Measuring Language Ideologies in
  Text Data Selection
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan
Dallas Card
Sarah K. Drier
E. K. Gade
Leroy Z. Wang
Zeyu Wang
Luke Zettlemoyer
Noah A. Smith
175
73
0
25 Jan 2022
BBQ: A Hand-Built Bias Benchmark for Question Answering
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish
Angelica Chen
Nikita Nangia
Vishakh Padmakumar
Jason Phang
Jana Thompson
Phu Mon Htut
Sam Bowman
223
374
0
15 Oct 2021
On Language Models for Creoles
On Language Models for Creoles
Heather Lent
Emanuele Bugliarello
Miryam de Lhoneux
Chen Qiu
Anders Søgaard
39
20
0
13 Sep 2021
Intersectional Bias in Causal Language Models
Intersectional Bias in Causal Language Models
Liam Magee
Lida Ghahremanlou
K. Soldatić
S. Robertson
191
31
0
16 Jul 2021
How Good is Your Tokenizer? On the Monolingual Performance of
  Multilingual Language Models
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
80
235
0
31 Dec 2020
Improving Multilingual Models with Language-Clustered Vocabularies
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
VLM
77
65
0
24 Oct 2020
PhoBERT: Pre-trained language models for Vietnamese
PhoBERT: Pre-trained language models for Vietnamese
Dat Quoc Nguyen
A. Nguyen
174
341
0
02 Mar 2020
Previous
12