Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1804.10959
Cited By
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
29 April 2018
Taku Kudo
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates"
50 / 617 papers shown
Title
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Marco Cognetta
Tatsuya Hiraoka
Naoaki Okazaki
Rico Sennrich
Yuval Pinter
29
2
0
30 Mar 2024
A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation
Francois Meyer
Jan Buys
39
1
0
29 Mar 2024
AlloyBERT: Alloy Property Prediction with Large Language Models
Akshat Chaudhari
Chakradhar Guntuboina
Hongshuo Huang
A. Farimani
37
4
0
28 Mar 2024
Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding
Run Shao
Zhaoyang Zhang
Chao Tao
Yunsheng Zhang
Chengli Peng
Haifeng Li
VLM
43
5
0
27 Mar 2024
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Inhwan Bae
Junoh Lee
Hae-Gon Jeon
36
15
0
27 Mar 2024
Provably Secure Disambiguating Neural Linguistic Steganography
Yuang Qi
Kejiang Chen
Kai Zeng
Weiming Zhang
Neng H. Yu
21
2
0
26 Mar 2024
Cross-lingual Contextualized Phrase Retrieval
Huayang Li
Deng Cai
Zhi Qu
Qu Cui
Hidetaka Kamigaito
Lemao Liu
Taro Watanabe
34
0
0
25 Mar 2024
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation
Kamal Kumar
Yinhan Liu
Parth Patwa
Tanmoy
Mihir Adam Roberts
27
1
0
25 Mar 2024
More than Just Statistical Recurrence: Human and Machine Unsupervised Learning of Māori Word Segmentation across Morphological Processes
A. Varatharaj
Simon Todd
14
0
0
21 Mar 2024
Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
M. Alrefaie
Nour Eldin Morsy
Nada Samir
25
6
0
17 Mar 2024
Using Contextual Information for Sentence-level Morpheme Segmentation
Prabin Bhandari
Abhishek Paudel
16
1
0
15 Mar 2024
Token Alignment via Character Matching for Subword Completion
Ben Athiwaratkun
Shiqi Wang
Mingyue Shang
Yuchen Tian
Zijian Wang
Sujan Kumar Gonugondla
Sanjay Krishna Gouda
Rob Kwiatowski
Ramesh Nallapati
Bing Xiang
50
4
0
13 Mar 2024
Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource Agglutinative Data-to-Text Generation
Francois Meyer
Jan Buys
29
2
0
12 Mar 2024
MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
Timothee Mickus
Stig-Arne Gronroos
Joseph Attieh
M. Boggia
Ona de Gibert
Shaoxiong Ji
Niki Andreas Lopi
Alessandro Raganato
Raúl Vázquez
Jörg Tiedemann
20
4
0
12 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Omer Goldman
Avi Caciularu
Matan Eyal
Kris Cao
Idan Szpektor
Reut Tsarfaty
51
22
0
10 Mar 2024
Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT
Aisha Khatun
Anisur Rahman
Md. Saiful Islam
Hemayet Ahmed Chowdhury
A. Tasnim
31
2
0
08 Mar 2024
Did Translation Models Get More Robust Without Anyone Even Noticing?
Ben Peters
André F. T. Martins
39
3
0
06 Mar 2024
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
Mathilde Caron
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
40
5
0
04 Mar 2024
Transformers for Low-Resource Languages:Is Féidir Linn!
Séamus Lankford
H. Alfi
Tamás Sarlós
40
17
0
04 Mar 2024
Language and Speech Technology for Central Kurdish Varieties
Sina Ahmadi
Daban Q. Jaff
Md Mahfuz Ibn Alam
Antonios Anastasopoulos
39
2
0
04 Mar 2024
adaptNMT: an open-source, language-agnostic development environment for Neural Machine Translation
Séamus Lankford
Haithem Afli
Andy Way
34
3
0
04 Mar 2024
Human Evaluation of English--Irish Transformer-Based NMT
Séamus Lankford
Haithem Afli
Andy Way
42
10
0
04 Mar 2024
VBART: The Turkish LLM
Meliksah Turker
Mehmet Erdi Ari
Aydin Han
VLM
36
4
0
02 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan
Craig W. Schmidt
Chris Tanner
Yuval Pinter
43
14
0
02 Mar 2024
Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models
Jinbiao Yang
LLMAG
105
11
0
01 Mar 2024
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models
Frederik Kunstner
Robin Yadav
Alan Milligan
Mark Schmidt
Alberto Bietti
39
25
0
29 Feb 2024
Beyond Language Models: Byte Models are Digital World Simulators
Shangda Wu
Xu Tan
Zili Wang
Rui Wang
Xiaobing Li
Maosong Sun
35
12
0
29 Feb 2024
CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection
Hao Wang
Zeyu Gao
Chao Zhang
Mingyang Sun
Yuchen Zhou
Han Qiu
Xiangwei Xiao
39
9
0
29 Feb 2024
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
61
28
0
28 Feb 2024
Natural Language Processing Methods for Symbolic Music Generation and Information Retrieval: a Survey
Dinh-Viet-Toan Le
Louis Bigo
Mikaela Keller
Dorien Herremans
MedIm
32
9
0
27 Feb 2024
CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
Hao Wang
Zeyu Gao
Chao Zhang
Zihan Sha
Mingyang Sun
Yuchen Zhou
Wenyu Zhu
Wenju Sun
Han Qiu
Xiangwei Xiao
38
17
0
26 Feb 2024
How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak
Adrien Bazoge
B. Daille
Mickael Rouvier
Richard Dufour
41
1
0
22 Feb 2024
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Aaditya K. Singh
DJ Strouse
43
46
0
22 Feb 2024
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
Aina Garí Soler
Matthieu Labeau
Chloé Clavel
VLM
42
2
0
22 Feb 2024
Two Counterexamples to Tokenization and the Noiseless Channel
Marco Cognetta
Vilém Zouhar
Sangwhan Moon
Naoaki Okazaki
27
0
0
22 Feb 2024
Subobject-level Image Tokenization
Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung
VLM
OCL
54
7
0
22 Feb 2024
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions?
Alexander Arno Weber
Klaudia Thellmann
Jan Ebert
Nicolas Flores-Herr
Jens Lehmann
Michael Fromm
Mehdi Ali
38
4
0
21 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
27
7
0
16 Feb 2024
PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control
Ruijie Zheng
Ching-An Cheng
Hal Daumé
Furong Huang
Andrey Kolobov
33
9
0
16 Feb 2024
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan
Gabriele Synnaeve
Baptiste Rozière
34
20
0
01 Feb 2024
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse
Patrick Fernandes
Nuno M. Guerreiro
António Loison
Duarte M. Alves
...
François Yvon
André F.T. Martins
Gautier Viaud
C´eline Hudelot
Pierre Colombo
55
32
0
01 Feb 2024
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
Ahnaf Mozib Samin
20
0
0
28 Jan 2024
Importance-Aware Data Augmentation for Document-Level Neural Machine Translation
Ming-Ru Wu
Yufei Wang
George F. Foster
Lizhen Qu
Gholamreza Haffari
43
6
0
27 Jan 2024
TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation
Gokcce Uludougan
Zeynep Yirmibecsouglu Balal
Furkan Akkurt
Melikcsah Turker
Onur Gungor
S. Uskudarli
39
12
0
25 Jan 2024
Revisiting the Optimality of Word Lengths
Tiago Pimentel
Clara Meister
Ethan Gotlieb Wilcox
Kyle Mahowald
Ryan Cotterell
35
7
0
06 Dec 2023
On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi
Harsh Chaudhari
A. Patil
Dhanashree Lavekar
Pranav Khairnar
Raviraj Joshi
Sachin Pande
44
0
0
03 Dec 2023
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
Fukun Yin
Xin Chen
C. Zhang
Biao Jiang
Zibo Zhao
Jiayuan Fan
Gang Yu
Taihao Li
Tao Chen
32
20
0
29 Nov 2023
Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context
Elijah Matthew Rippeth
Marine Carpuat
Kevin Duh
Matt Post
18
0
0
27 Nov 2023
PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords
Panyut Sriwirote
Jalinee Thapiang
Vasan Timtong
Attapol T. Rutherford
16
5
0
21 Nov 2023
Multi-teacher Distillation for Multilingual Spelling Correction
Jingfen Zhang
Xuan Guo
S. Bodapati
Christopher Potts
KELM
27
3
0
20 Nov 2023
Previous
1
2
3
4
5
6
...
11
12
13
Next