Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1804.10959
Cited By
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
29 April 2018
Taku Kudo
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates"
50 / 628 papers shown
Title
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Thao Anh Dang
Limor Raviv
Lukas Galke
52
1
0
15 Oct 2024
Generation with Dynamic Vocabulary
Yanting Liu
Tao Ji
Changzhi Sun
Yuanbin Wu
Xiaoling Wang
79
1
0
11 Oct 2024
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
138
2
0
11 Oct 2024
Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow
Cyrile Delestre
Yoann Sola
34
0
0
10 Oct 2024
Inference over Unseen Entities, Relations and Literals on Knowledge Graphs
Caglar Demir
N'Dah Jean Kouagou
Arnab Sharma
Axel-Cyrille Ngonga Ngomo
53
0
0
09 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
107
14
0
08 Oct 2024
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics
Stefano Perrella
Lorenzo Proietti
Pere-Lluís Huguet Cabot
Edoardo Barba
Roberto Navigli
95
4
0
07 Oct 2024
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Alex Cloud
Jacob Goldman-Wetzler
Evžen Wybitul
Joseph Miller
Alexander Matt Turner
65
3
0
06 Oct 2024
Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)
Abrar Rahman
Garry Bowlin
Binit Mohanty
Sean McGunigal
41
0
0
04 Oct 2024
Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation
Dinh-Viet-Toan Le
Louis Bigo
Mikaela Keller
61
1
0
02 Oct 2024
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido
Sara Papi
L. Bentivogli
Alessio Brutti
Mauro Cettolo
R. Gretter
M. Matassoni
Mohamed Nabih
Matteo Negri
85
6
0
01 Oct 2024
Alignment-Free Training for Transducer-based Multi-Talker ASR
Takafumi Moriya
Shota Horiguchi
Marc Delcroix
Ryo Masumura
Takanori Ashihara
Hiroshi Sato
Kohei Matsuura
Masato Mimura
92
4
0
30 Sep 2024
Exploring Language Model Generalization in Low-Resource Extractive QA
Saptarshi Sengupta
Wenpeng Yin
Preslav Nakov
Shreya Ghosh
Suhang Wang
96
1
0
27 Sep 2024
LangSAMP: Language-Script Aware Multilingual Pretraining
Yihong Liu
Haotian Ye
Chunlan Ma
Mingyang Wang
Hinrich Schütze
VLM
246
0
0
26 Sep 2024
How Transliterations Improve Crosslingual Alignment
Yihong Liu
Mingyang Wang
Amir Hossein Kargaran
Ayyoob Imani
Orgest Xhelili
Haotian Ye
Chunlan Ma
François Yvon
Hinrich Schütze
91
4
0
25 Sep 2024
Tokenization for Molecular Foundation Models
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
472
1
0
19 Sep 2024
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
Menan Velayuthan
Kengatharaiyer Sarveswaran
104
7
0
17 Sep 2024
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
Kohei Tsuji
Tatsuya Hiraoka
Yuchang Cheng
Tomoya Iwakura
76
1
0
10 Sep 2024
Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation
Nithin Rao Koluguri
Travis M. Bartley
Hainan Xu
Oleksii Hrinchuk
Jagadeesh Balam
Boris Ginsburg
Georg Kucsko
85
3
0
09 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
85
5
0
06 Sep 2024
Post-OCR Text Correction for Bulgarian Historical Documents
Angel Beshirov
Milena Dobreva
Dimitar Dimitrov
Momchil Hardalov
Ivan Koychev
Preslav Nakov
65
1
0
31 Aug 2024
Self-supervised Speech Representations Still Struggle with African American Vernacular English
Kalvin Chang
Yi-Hui Chou
Jiatong Shi
Hsuan-Ming Chen
Nicole Holliday
Odette Scharenborg
David R. Mortensen
76
3
0
26 Aug 2024
Distributional Properties of Subword Regularization
Marco Cognetta
Vilém Zouhar
Naoaki Okazaki
67
0
0
21 Aug 2024
Where is the signal in tokenization space?
Renato Lui Geh
Honghua Zhang
Kareem Ahmed
Benjie Wang
Guy Van den Broeck
74
7
0
16 Aug 2024
Generating Gender Alternatives in Machine Translation
Sarthak Garg
Mozhdeh Gheini
Clara Emmanuel
Tatiana Likhomanenko
Qin Gao
Matthias Paulik
67
4
0
29 Jul 2024
Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models
Brigita Vileikyt.e
M. Lukoševičius
Lukas Stankevicius
89
1
0
29 Jul 2024
Towards scalable efficient on-device ASR with transfer learning
Laxmi Pandey
Ke Li
Jinxi Guo
Debjyoti Paul
Arthur Guo
Jay Mahadeokar
Xuedong Zhang
66
2
0
23 Jul 2024
Genomic Language Models: Opportunities and Challenges
Gonzalo Benegas
Chengzhong Ye
C. Albors
Jianan Canal Li
Yun S. Song
AI4CE
LM&MA
ELM
129
26
0
16 Jul 2024
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Valentin Hoffman
Tomasz Limisiewicz
Yulia Tsvetkov
Noah A. Smith
99
5
0
11 Jul 2024
Romanization Encoding For Multilingual ASR
Wen Ding
Fei Jia
Hainan Xu
Yu Xi
Junjie Lai
Boris Ginsburg
65
0
0
05 Jul 2024
Improving Self Consistency in LLMs through Probabilistic Tokenization
Ashutosh Sathe
Divyanshu Aggarwal
Sunayana Sitaram
110
5
0
04 Jul 2024
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
429
2
0
03 Jul 2024
A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning
Ramakrishna Appicharla
Baban Gain
Santanu Pal
Asif Ekbal
Pushpak Bhattacharyya
56
2
0
03 Jul 2024
xSemAD: Explainable Semantic Anomaly Detection in Event Logs Using Sequence-to-Sequence Models
Kiran Busch
T. Kampik
Henrik Leopold
27
3
0
28 Jun 2024
Large Vocabulary Size Improves Large Language Models
Sho Takase
Ryokan Ri
Shun Kiyono
Takuya Kato
133
4
0
24 Jun 2024
Unsupervised Morphological Tree Tokenizer
Qingyang Zhu
Xiang Hu
Pengyu Ji
Wei Wu
Kewei Tu
88
0
0
21 Jun 2024
Infusing clinical knowledge into tokenisers for language models
Abul Hasan
Jinge Wu
Quang Ngoc Nguyen
Salomé Andres
Imane Guellil
Huayu Zhang
Arlene Casey
Beatrice Alex
Bruce Guthrie
Honghan Wu
79
2
0
20 Jun 2024
Lexically Grounded Subword Segmentation
Jindřich Libovický
Jindřich Helcl
116
3
0
19 Jun 2024
Children's Speech Recognition through Discrete Token Enhancement
Vrunda N. Sukhadia
Shammur A. Chowdhury
84
1
0
19 Jun 2024
Tokenization Falling Short: The Curse of Tokenization
Yekun Chai
Yewei Fang
Qiwei Peng
Xuhong Li
74
0
0
17 Jun 2024
To be Continuous, or to be Discrete, Those are Bits of Questions
Yiran Wang
Masao Utiyama
80
4
0
12 Jun 2024
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Avihu Dekel
Raul Fernandez
80
2
0
08 Jun 2024
Xmodel-LM Technical Report
Yichuan Wang
Yang Liu
Yu Yan
Qun Wang
Xucheng Huang
Ling Jiang
OSLM
ALM
57
1
0
05 Jun 2024
Multi-word Term Embeddings Improve Lexical Product Retrieval
Viktor Shcherbakov
Fedor Krasnov
47
0
0
03 Jun 2024
YODAS: Youtube-Oriented Dataset for Audio and Speech
Xinjian Li
Shinnosuke Takamichi
Takaaki Saeki
William Chen
Sayaka Shiota
Shinji Watanabe
139
27
0
02 Jun 2024
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Dixuan Wang
Yanda Li
Junyuan Jiang
Zepeng Ding
Ziqin Luo
Guochao Jiang
Jiaqing Liang
Deqing Yang
121
16
0
27 May 2024
Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities
Hao Zhou
Chengming Hu
Ye Yuan
Yufei Cui
Yili Jin
...
Di Wu
Xue Liu
Charlie Zhang
Xianbin Wang
Jiangchuan Liu
113
79
0
17 May 2024
SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
Marco Gaido
Sara Papi
Matteo Negri
Mauro Cettolo
L. Bentivogli
86
1
0
17 May 2024
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Yihong Liu
Chunlan Ma
Haotian Ye
Hinrich Schütze
63
4
0
16 May 2024
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Wanting Xu
Yang Liu
Langping He
Xucheng Huang
Ling Jiang
VLM
MLLM
64
2
0
15 May 2024
Previous
1
2
3
4
5
...
11
12
13
Next