ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1804.10959
  4. Cited By
Subword Regularization: Improving Neural Network Translation Models with
  Multiple Subword Candidates

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

29 April 2018
Taku Kudo
ArXivPDFHTML

Papers citing "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates"

50 / 617 papers shown
Title
LangSAMP: Language-Script Aware Multilingual Pretraining
LangSAMP: Language-Script Aware Multilingual Pretraining
Yihong Liu
Haotian Ye
Chunlan Ma
Mingyang Wang
Hinrich Schütze
VLM
31
0
0
26 Sep 2024
How Transliterations Improve Crosslingual Alignment
How Transliterations Improve Crosslingual Alignment
Yihong Liu
Mingyang Wang
Amir Hossein Kargaran
Ayyoob Imani
Orgest Xhelili
Haotian Ye
Chunlan Ma
François Yvon
Hinrich Schütze
42
2
0
25 Sep 2024
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
174
0
0
19 Sep 2024
Egalitarian Language Representation in Language Models: It All Begins
  with Tokenizers
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
Menan Velayuthan
Kengatharaiyer Sarveswaran
40
5
0
17 Sep 2024
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
Kohei Tsuji
Tatsuya Hiraoka
Yuchang Cheng
Tomoya Iwakura
45
1
0
10 Sep 2024
Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training
  for Enhanced Speech Recognition and Translation
Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation
Nithin Rao Koluguri
Travis M. Bartley
Hainan Xu
Oleksii Hrinchuk
Jagadeesh Balam
Boris Ginsburg
Georg Kucsko
41
3
0
09 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
48
2
0
06 Sep 2024
Post-OCR Text Correction for Bulgarian Historical Documents
Post-OCR Text Correction for Bulgarian Historical Documents
Angel Beshirov
Milena Dobreva
Dimitar Dimitrov
Momchil Hardalov
Ivan Koychev
Preslav Nakov
42
1
0
31 Aug 2024
Self-supervised Speech Representations Still Struggle with African
  American Vernacular English
Self-supervised Speech Representations Still Struggle with African American Vernacular English
Kalvin Chang
Yi-Hui Chou
Jiatong Shi
Hsuan-Ming Chen
Nicole Holliday
Odette Scharenborg
David R. Mortensen
33
2
0
26 Aug 2024
Distributional Properties of Subword Regularization
Distributional Properties of Subword Regularization
Marco Cognetta
Vilém Zouhar
Naoaki Okazaki
37
0
0
21 Aug 2024
Where is the signal in tokenization space?
Where is the signal in tokenization space?
Renato Lui Geh
Honghua Zhang
Kareem Ahmed
Benjie Wang
Mathias Niepert
33
4
0
16 Aug 2024
Generating Gender Alternatives in Machine Translation
Generating Gender Alternatives in Machine Translation
Sarthak Garg
Mozhdeh Gheini
Clara Emmanuel
Tatiana Likhomanenko
Qin Gao
Matthias Paulik
41
2
0
29 Jul 2024
Sentiment Analysis of Lithuanian Online Reviews Using Large Language
  Models
Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models
Brigita Vileikyt.e
M. Lukoševičius
Lukas Stankevicius
20
1
0
29 Jul 2024
Towards scalable efficient on-device ASR with transfer learning
Towards scalable efficient on-device ASR with transfer learning
Laxmi Pandey
Ke Li
Jinxi Guo
Debjyoti Paul
Arthur Guo
Jay Mahadeokar
Xuedong Zhang
36
2
0
23 Jul 2024
Genomic Language Models: Opportunities and Challenges
Genomic Language Models: Opportunities and Challenges
Gonzalo Benegas
Chengzhong Ye
C. Albors
Jianan Canal Li
Yun S. Song
AI4CE
LM&MA
ELM
50
18
0
16 Jul 2024
MAGNET: Improving the Multilingual Fairness of Language Models with
  Adaptive Gradient-Based Tokenization
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Valentin Hoffman
Tomasz Limisiewicz
Yulia Tsvetkov
Noah A. Smith
51
4
0
11 Jul 2024
Romanization Encoding For Multilingual ASR
Romanization Encoding For Multilingual ASR
Wen Ding
Fei Jia
Hainan Xu
Yu Xi
Junjie Lai
Boris Ginsburg
31
0
0
05 Jul 2024
Improving Self Consistency in LLMs through Probabilistic Tokenization
Improving Self Consistency in LLMs through Probabilistic Tokenization
Ashutosh Sathe
Divyanshu Aggarwal
Sunayana Sitaram
45
4
0
04 Jul 2024
Single Character Perturbations Break LLM Alignment
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
188
2
0
03 Jul 2024
A Case Study on Context-Aware Neural Machine Translation with Multi-Task
  Learning
A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning
Ramakrishna Appicharla
Baban Gain
Santanu Pal
Asif Ekbal
Pushpak Bhattacharyya
23
1
0
03 Jul 2024
xSemAD: Explainable Semantic Anomaly Detection in Event Logs Using
  Sequence-to-Sequence Models
xSemAD: Explainable Semantic Anomaly Detection in Event Logs Using Sequence-to-Sequence Models
Kiran Busch
T. Kampik
Henrik Leopold
15
2
0
28 Jun 2024
Large Vocabulary Size Improves Large Language Models
Large Vocabulary Size Improves Large Language Models
Sho Takase
Ryokan Ri
Shun Kiyono
Takuya Kato
45
3
0
24 Jun 2024
Unsupervised Morphological Tree Tokenizer
Unsupervised Morphological Tree Tokenizer
Qingyang Zhu
Xiang Hu
Pengyu Ji
Wei Wu
Kewei Tu
39
0
0
21 Jun 2024
Infusing clinical knowledge into tokenisers for language models
Infusing clinical knowledge into tokenisers for language models
Abul Hasan
Jinge Wu
Quang Ngoc Nguyen
Salomé Andres
Imane Guellil
Huayu Zhang
Arlene Casey
Beatrice Alex
Bruce Guthrie
Honghan Wu
46
1
0
20 Jun 2024
Lexically Grounded Subword Segmentation
Lexically Grounded Subword Segmentation
Jindřich Libovický
Jindřich Helcl
43
1
0
19 Jun 2024
Children's Speech Recognition through Discrete Token Enhancement
Children's Speech Recognition through Discrete Token Enhancement
Vrunda N. Sukhadia
Shammur A. Chowdhury
48
1
0
19 Jun 2024
Tokenization Falling Short: The Curse of Tokenization
Tokenization Falling Short: The Curse of Tokenization
Yekun Chai
Yewei Fang
Qiwei Peng
Xuhong Li
52
0
0
17 Jun 2024
To be Continuous, or to be Discrete, Those are Bits of Questions
To be Continuous, or to be Discrete, Those are Bits of Questions
Yiran Wang
Masao Utiyama
53
2
0
12 Jun 2024
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Avihu Dekel
Raul Fernandez
49
2
0
08 Jun 2024
Xmodel-LM Technical Report
Xmodel-LM Technical Report
Yichuan Wang
Yang Liu
Yu Yan
Qun Wang
Xucheng Huang
Ling Jiang
OSLM
ALM
35
1
0
05 Jun 2024
Multi-word Term Embeddings Improve Lexical Product Retrieval
Multi-word Term Embeddings Improve Lexical Product Retrieval
Viktor Shcherbakov
Fedor Krasnov
28
0
0
03 Jun 2024
YODAS: Youtube-Oriented Dataset for Audio and Speech
YODAS: Youtube-Oriented Dataset for Audio and Speech
Xinjian Li
Shinnosuke Takamichi
Takaaki Saeki
William Chen
Sayaka Shiota
Shinji Watanabe
42
17
0
02 Jun 2024
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Dixuan Wang
Yanda Li
Junyuan Jiang
Zepeng Ding
Ziqin Luo
Guochao Jiang
Jiaqing Liang
Deqing Yang
27
11
0
27 May 2024
Large Language Model (LLM) for Telecommunications: A Comprehensive
  Survey on Principles, Key Techniques, and Opportunities
Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities
Hao Zhou
Chengming Hu
Ye Yuan
Yufei Cui
Yili Jin
...
Di Wu
Xue Liu
Charlie Zhang
Xianbin Wang
Jiangchuan Liu
35
59
0
17 May 2024
SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
Marco Gaido
Sara Papi
Matteo Negri
Mauro Cettolo
L. Bentivogli
43
1
0
17 May 2024
TransMI: A Framework to Create Strong Baselines from Multilingual
  Pretrained Language Models for Transliterated Data
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Yihong Liu
Chunlan Ma
Haotian Ye
Hinrich Schütze
36
4
0
16 May 2024
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Wanting Xu
Yang Liu
Langping He
Xucheng Huang
Ling Jiang
VLM
MLLM
43
2
0
15 May 2024
Zero-Shot Tokenizer Transfer
Zero-Shot Tokenizer Transfer
Benjamin Minixhofer
Edoardo Ponti
Ivan Vulić
VLM
44
9
0
13 May 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in
  Large Language Models
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Sander Land
Max Bartolo
49
21
0
08 May 2024
Revisiting N-Gram Models: Their Impact in Modern Neural Networks for
  Handwritten Text Recognition
Revisiting N-Gram Models: Their Impact in Modern Neural Networks for Handwritten Text Recognition
Solène Tarride
Christopher Kermorvant
37
1
0
30 Apr 2024
A cost minimization approach to fix the vocabulary size in a tokenizer
  for an End-to-End ASR system
A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system
Sunil Kumar Kopparapu
Ashish Panda
31
0
0
29 Apr 2024
Can Perplexity Predict Fine-Tuning Performance? An Investigation of
  Tokenization Effects on Sequential Language Models for Nepali
Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali
Nishant Luitel
Nirajan Bekoju
Anand Kumar Sah
Subarna Shakya
52
1
0
28 Apr 2024
Act as a Honeytoken Generator! An Investigation into Honeytoken
  Generation with Large Language Models
Act as a Honeytoken Generator! An Investigation into Honeytoken Generation with Large Language Models
Daniel Reti
Norman Becker
Tillmann Angeli
Anasuya Chattopadhyay
Daniel Schneider
Sebastian Vollmer
Hans D. Schotten
40
5
0
24 Apr 2024
Evaluating Subword Tokenization: Alien Subword Composition and OOV
  Generalization Challenge
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Khuyagbaatar Batsuren
Ekaterina Vylomova
Verna Dankers
Tsetsuukhei Delgerbaatar
Omri Uzan
Yuval Pinter
Gábor Bella
35
9
0
20 Apr 2024
Simultaneous Interpretation Corpus Construction by Large Language Models
  in Distant Language Pair
Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
Yusuke Sakai
Mana Makinae
Hidetaka Kamigaito
Taro Watanabe
35
4
0
18 Apr 2024
On the Effect of (Near) Duplicate Subwords in Language Modelling
On the Effect of (Near) Duplicate Subwords in Language Modelling
Anton Schäfer
Thomas Hofmann
Imanol Schlag
Tiago Pimentel
42
1
0
09 Apr 2024
Training LLMs over Neurally Compressed Text
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
40
6
0
04 Apr 2024
Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium
  Approach for Binary Vulnerability Detection
Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability Detection
Litao Li
Steven H. H. Ding
Andrew Walenstein
P. Charland
Benjamin C. M. Fung
34
0
0
03 Apr 2024
Revisiting subword tokenization: A case study on affixal negation in
  large language models
Revisiting subword tokenization: A case study on affixal negation in large language models
Thinh Hung Truong
Yulia Otmakhova
Karin Verspoor
Trevor Cohn
Timothy Baldwin
47
2
0
03 Apr 2024
Forklift: An Extensible Neural Lifter
Forklift: An Extensible Neural Lifter
Jordi Armengol-Estapé
Rodrigo C. O. Rocha
Jackson Woodruff
Pasquale Minervini
Michael F. P. O'Boyle
35
0
0
01 Apr 2024
Previous
12345...111213
Next