Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1804.10959
Cited By
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
29 April 2018
Taku Kudo
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates"
50 / 628 papers shown
Title
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
70
11
0
16 Feb 2024
PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control
Ruijie Zheng
Ching-An Cheng
Hal Daumé
Furong Huang
Andrey Kolobov
85
12
0
16 Feb 2024
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan
Gabriele Synnaeve
Baptiste Rozière
136
30
0
01 Feb 2024
CroissantLLM: A Truly Bilingual French-English Language Model
Manuel Faysse
Patrick Fernandes
Nuno M. Guerreiro
António Loison
Duarte M. Alves
...
François Yvon
André F.T. Martins
Gautier Viaud
C´eline Hudelot
Pierre Colombo
165
37
0
01 Feb 2024
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition
Ahnaf Mozib Samin
57
0
0
28 Jan 2024
Importance-Aware Data Augmentation for Document-Level Neural Machine Translation
Ming-Ru Wu
Yufei Wang
George F. Foster
Zhuang Li
Gholamreza Haffari
98
7
0
27 Jan 2024
TURNA: A Turkish Encoder-Decoder Language Model for Enhanced Understanding and Generation
Gokcce Uludougan
Zeynep Yirmibecsouglu Balal
Furkan Akkurt
Melikcsah Turker
Onur Gungor
S. Uskudarli
72
12
0
25 Jan 2024
Revisiting the Optimality of Word Lengths
Tiago Pimentel
Clara Meister
Ethan Gotlieb Wilcox
Kyle Mahowald
Ryan Cotterell
59
8
0
06 Dec 2023
On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi
Harsh Chaudhari
A. Patil
Dhanashree Lavekar
Pranav Khairnar
Raviraj Joshi
Sachin Pande
68
0
0
03 Dec 2023
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
Fukun Yin
Xin Chen
C. Zhang
Biao Jiang
Zibo Zhao
Jiayuan Fan
Gang Yu
Taihao Li
Tao Chen
124
23
0
29 Nov 2023
Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context
Elijah Matthew Rippeth
Marine Carpuat
Kevin Duh
Matt Post
40
0
0
27 Nov 2023
PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords
Panyut Sriwirote
Jalinee Thapiang
Vasan Timtong
Attapol T. Rutherford
50
5
0
21 Nov 2023
Multi-teacher Distillation for Multilingual Spelling Correction
Jingfen Zhang
Xuan Guo
S. Bodapati
Christopher Potts
KELM
52
3
0
20 Nov 2023
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Jian Zhu
Changbing Yang
Farhan Samir
Jahurul Islam
96
7
0
14 Nov 2023
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model
Nohil Park
Joonsuk Park
Kang Min Yoo
Sungroh Yoon
81
3
0
14 Nov 2023
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
Zixiang Xian
Rubing Huang
Dave Towey
Chunrong Fang
Zhenyu Chen
75
6
0
10 Nov 2023
Legal-HNet: Mixing Legal Long-Context Tokens with Hartley Transform
Daniele Giofré
Sneha Ghantasala
AILaw
70
0
0
09 Nov 2023
Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content
Haijian Shao
Ming Zhu
Shengjie Zhai
AI4MH
32
3
0
09 Nov 2023
Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
Qian Chen
Wen Wang
Qinglin Zhang
Siqi Zheng
Shiliang Zhang
Chong Deng
Yukun Ma
Hai Yu
Jiaqing Liu
Chong Zhang
83
9
0
08 Nov 2023
Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition
Tae-Hee Jeon
Bongseok Yang
ChangHwan Kim
Yoonseob Lim
46
1
0
07 Nov 2023
Too Much Information: Keeping Training Simple for BabyLMs
Lukas Edman
Lisa Bylinina
74
4
0
03 Nov 2023
The Unreasonable Effectiveness of Random Target Embeddings for Continuous-Output Neural Machine Translation
Evgeniia Tokarchuk
Vlad Niculae
67
2
0
31 Oct 2023
MUST: A Multilingual Student-Teacher Learning approach for low-resource speech recognition
Muhammad Umar Farooq
Rehan Ahmad
Thomas Hain
55
0
0
29 Oct 2023
BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?
Xingmeng Zhao
Tongnian Wang
Sheri Osborn
Anthony Rios
53
6
0
25 Oct 2023
Machine Translation for Nko: Tools, Corpora and Baseline Results
M. Doumbouya
Baba Mamadi Diané
Solo Farabado Cissé
Djibrila Diané
Abdoulaye Sow
...
Fodé Moriba Bayo
Ibrahima Sory 2. Condé
Kalo Mory Diané
Chris Piech
Christopher D. Manning
73
3
0
24 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
63
20
0
20 Oct 2023
Character-level Chinese Backpack Language Models
Hao Sun
John Hewitt
61
0
0
19 Oct 2023
Document-Level Language Models for Machine Translation
Frithjof Petrick
Christian Herold
Pavel Petrushkov
Shahram Khadivi
Hermann Ney
56
10
0
18 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Avijit Thawani
Saurabh Ghanekar
Xiaoyuan Zhu
Jay Pujara
104
5
0
17 Oct 2023
Optimized Tokenization for Transcribed Error Correction
Tomer Wullach
Shlomo E. Chazan
76
0
0
16 Oct 2023
UvA-MT's Participation in the WMT23 General Translation Shared Task
Di Wu
Shaomu Tan
David Stap
Ali Araabi
Christof Monz
96
3
0
15 Oct 2023
Towards Example-Based NMT with Multi-Levenshtein Transformers
Maxime Bouthors
Josep Crego
François Yvon
101
4
0
13 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
116
61
0
12 Oct 2023
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu
Naihao Deng
Sahand Sabour
Yilin Jia
Minlie Huang
Rada Mihalcea
88
23
0
09 Oct 2023
Generative Spoken Language Model based on continuous word-sized audio tokens
Robin Algayres
Yossi Adi
Tu Nguyen
Jade Copet
Gabriel Synnaeve
Benoît Sagot
Emmanuel Dupoux
AuLLM
119
16
0
08 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
104
10
0
06 Oct 2023
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
Aleksandr Meister
Matvei Novikov
Nikolay Karpov
Evelina Bakhturina
Vitaly Lavrukhin
Boris Ginsburg
62
16
0
04 Oct 2023
CAT-LM: Training Language Models on Aligned Code And Tests
Nikitha Rao
Kush Jain
Uri Alon
Claire Le Goues
Vincent J. Hellendoorn
ALM
83
47
0
02 Oct 2023
Enhancing Representation Generalization in Authorship Identification
Haining Wang
64
0
0
30 Sep 2023
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Andrew Rouditchenko
R. Collobert
Tatiana Likhomanenko
VLM
88
3
0
29 Sep 2023
JCoLA: Japanese Corpus of Linguistic Acceptability
Taiga Someya
Yushi Sugimoto
Yohei Oseki
69
6
0
22 Sep 2023
Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
Bar Iluz
Tomasz Limisiewicz
Gabriel Stanovsky
David Marevcek
110
4
0
21 Sep 2023
Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Peter Polák
Ondrej Bojar
65
3
0
20 Sep 2023
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
Peter Polák
Brian Yan
Shinji Watanabe
A. Waibel
Ondrej Bojar
45
9
0
20 Sep 2023
Language Modeling Is Compression
Grégoire Delétang
Anian Ruoss
Paul-Ambroise Duquenne
Elliot Catt
Tim Genewein
...
Wenliang Kevin Li
Matthew Aitchison
Laurent Orseau
Marcus Hutter
J. Veness
AI4CE
121
146
0
19 Sep 2023
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan Rossi
Thien Huu Nguyen
109
112
0
17 Sep 2023
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Soumi Maiti
Yifan Peng
Shukjae Choi
Jee-weon Jung
Xuankai Chang
Shinji Watanabe
VLM
AuLLM
125
69
0
14 Sep 2023
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
David Yunis
Justin Jung
Falcon Z. Dai
Matthew R. Walter
OffRL
88
0
0
08 Sep 2023
Multilingual Text Representation
Fahim Faisal
51
0
0
02 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
191
410
0
01 Sep 2023
Previous
1
2
3
4
5
...
11
12
13
Next