Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1804.10959
Cited By
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
29 April 2018
Taku Kudo
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates"
50 / 617 papers shown
Title
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Jian Zhu
Changbing Yang
Farhan Samir
Jahurul Islam
32
4
0
14 Nov 2023
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model
Nohil Park
Joonsuk Park
Kang Min Yoo
Sungroh Yoon
36
3
0
14 Nov 2023
TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation
Zixiang Xian
Rubing Huang
Dave Towey
Chunrong Fang
Zhenyu Chen
25
5
0
10 Nov 2023
Legal-HNet: Mixing Legal Long-Context Tokens with Hartley Transform
Daniele Giofré
Sneha Ghantasala
AILaw
29
0
0
09 Nov 2023
Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content
Haijian Shao
Ming Zhu
Shengjie Zhai
AI4MH
14
2
0
09 Nov 2023
Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR
Qian Chen
Wen Wang
Qinglin Zhang
Siqi Zheng
Shiliang Zhang
Chong Deng
Yukun Ma
Hai Yu
Jiaqing Liu
Chong Zhang
21
8
0
08 Nov 2023
Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition
Tae-Hee Jeon
Bongseok Yang
ChangHwan Kim
Yoonseob Lim
19
0
0
07 Nov 2023
Too Much Information: Keeping Training Simple for BabyLMs
Lukas Edman
Lisa Bylinina
32
4
0
03 Nov 2023
The Unreasonable Effectiveness of Random Target Embeddings for Continuous-Output Neural Machine Translation
Evgeniia Tokarchuk
Vlad Niculae
27
2
0
31 Oct 2023
MUST: A Multilingual Student-Teacher Learning approach for low-resource speech recognition
Muhammad Umar Farooq
Rehan Ahmad
Thomas Hain
25
0
0
29 Oct 2023
BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories?
Xingmeng Zhao
Tongnian Wang
Sheri Osborn
Anthony Rios
15
4
0
25 Oct 2023
Machine Translation for Nko: Tools, Corpora and Baseline Results
M. Doumbouya
Baba Mamadi Diané
Solo Farabado Cissé
Djibrila Diané
Abdoulaye Sow
...
Fodé Moriba Bayo
Ibrahima Sory 2. Condé
Kalo Mory Diané
Chris Piech
Christopher D. Manning
41
3
0
24 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
29
17
0
20 Oct 2023
Character-level Chinese Backpack Language Models
Hao Sun
John Hewitt
27
0
0
19 Oct 2023
Document-Level Language Models for Machine Translation
Frithjof Petrick
Christian Herold
Pavel Petrushkov
Shahram Khadivi
Hermann Ney
26
9
0
18 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Avijit Thawani
Saurabh Ghanekar
Xiaoyuan Zhu
Jay Pujara
38
4
0
17 Oct 2023
Optimized Tokenization for Transcribed Error Correction
Tomer Wullach
Shlomo E. Chazan
32
0
0
16 Oct 2023
UvA-MT's Participation in the WMT23 General Translation Shared Task
Di Wu
Shaomu Tan
David Stap
Ali Araabi
Christof Monz
32
3
0
15 Oct 2023
Towards Example-Based NMT with Multi-Levenshtein Transformers
Maxime Bouthors
Josep Crego
François Yvon
24
4
0
13 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
23
47
0
12 Oct 2023
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu
Naihao Deng
Sahand Sabour
Yilin Jia
Minlie Huang
Rada Mihalcea
35
18
0
09 Oct 2023
Generative Spoken Language Model based on continuous word-sized audio tokens
Robin Algayres
Yossi Adi
Tu Nguyen
Jade Copet
Gabriel Synnaeve
Benoît Sagot
Emmanuel Dupoux
AuLLM
43
12
0
08 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
19
10
0
06 Oct 2023
LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models
Aleksandr Meister
Matvei Novikov
Nikolay Karpov
Evelina Bakhturina
Vitaly Lavrukhin
Boris Ginsburg
22
12
0
04 Oct 2023
CAT-LM: Training Language Models on Aligned Code And Tests
Nikitha Rao
Kush Jain
Uri Alon
Claire Le Goues
Vincent J. Hellendoorn
ALM
42
42
0
02 Oct 2023
Enhancing Representation Generalization in Authorship Identification
Haining Wang
10
0
0
30 Sep 2023
AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
Andrew Rouditchenko
R. Collobert
Tatiana Likhomanenko
VLM
27
3
0
29 Sep 2023
JCoLA: Japanese Corpus of Linguistic Acceptability
Taiga Someya
Yushi Sugimoto
Yohei Oseki
32
5
0
22 Sep 2023
Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
Bar Iluz
Tomasz Limisiewicz
Gabriel Stanovsky
David Marevcek
32
3
0
21 Sep 2023
Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Peter Polák
Ondrej Bojar
46
3
0
20 Sep 2023
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
Peter Polák
Brian Yan
Shinji Watanabe
A. Waibel
Ondrej Bojar
28
9
0
20 Sep 2023
Language Modeling Is Compression
Grégoire Delétang
Anian Ruoss
Paul-Ambroise Duquenne
Elliot Catt
Tim Genewein
...
Wenliang Kevin Li
Matthew Aitchison
Laurent Orseau
Marcus Hutter
J. Veness
AI4CE
48
131
0
19 Sep 2023
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan A. Rossi
Thien Huu Nguyen
45
97
0
17 Sep 2023
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
Soumi Maiti
Yifan Peng
Shukjae Choi
Jee-weon Jung
Xuankai Chang
Shinji Watanabe
VLM
AuLLM
29
57
0
14 Sep 2023
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
David Yunis
Justin Jung
Falcon Z. Dai
Matthew R. Walter
OffRL
47
0
0
08 Sep 2023
Multilingual Text Representation
Fahim Faisal
27
0
0
02 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
60
340
0
01 Sep 2023
Construction Grammar and Language Models
Harish Tayyar Madabushi
Laurence Romain
P. Milin
Dagmar Divjak
29
5
0
25 Aug 2023
Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection
Benjamin C. Warner
Ziqi Xu
S. Haroutounian
Thomas Kannampallil
Chenyang Lu
22
2
0
19 Aug 2023
Reinforced Self-Training (ReST) for Language Modeling
Çağlar Gülçehre
T. Paine
S. Srinivasan
Ksenia Konyushkova
L. Weerts
...
Chenjie Gu
Wolfgang Macherey
Arnaud Doucet
Orhan Firat
Nando de Freitas
OffRL
66
278
0
17 Aug 2023
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
Jeong Hun Yeo
Minsu Kim
J. Choi
Dae Hoe Kim
Y. Ro
26
18
0
15 Aug 2023
SOTASTREAM: A Streaming Approach to Machine Translation Training
Matt Post
Thamme Gowda
Roman Grundkiewicz
Huda Khayrallah
Rohit Jain
Marcin Junczys-Dowmunt
27
5
0
14 Aug 2023
N-gram Boosting: Improving Contextual Biasing with Normalized N-gram Targets
Wang Yau Li
Shreekantha Nadig
K. Chang
Zafarullah Mahmood
Riqiang Wang
Simon Vandieken
Jonas Robertson
Frederic Mailhot
22
0
0
04 Aug 2023
Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation
Minsu Kim
J. Choi
Dahun Kim
Y. Ro
42
10
0
03 Aug 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Nadezhda Chirkova
Sergey Troshin
21
8
0
01 Aug 2023
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation
Haiyue Song
Raj Dabre
Chenhui Chu
Sadao Kurohashi
Eiichiro Sumita
21
3
0
31 Jul 2023
Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Michael Gunther
Louis Milliken
Jonathan Geuter
Georgios Mastrapas
Bo Wang
Han Xiao
RALM
50
30
0
20 Jul 2023
MorphPiece : A Linguistic Tokenizer for Large Language Models
Jeffrey Hsu
32
3
0
14 Jul 2023
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Mian
OffRL
70
529
0
12 Jul 2023
Testing the Predictions of Surprisal Theory in 11 Languages
Ethan Gotlieb Wilcox
Tiago Pimentel
Clara Meister
Ryan Cotterell
R. Levy
LRM
52
63
0
07 Jul 2023
Previous
1
2
3
4
5
...
11
12
13
Next