ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 1,950 papers shown
Title
Efficient Online Inference of Vision Transformers by Training-Free Tokenization
Efficient Online Inference of Vision Transformers by Training-Free Tokenization
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
197
0
0
01 Jul 2025
NepaliGPT: A Generative Language Model for the Nepali Language
NepaliGPT: A Generative Language Model for the Nepali Language
Shushanta Pudasaini
Aman Shakya
Siddhartha Shrestha
Sahil Bhatta
Sunil Thapa
Sushmita Palikhe
5
0
0
19 Jun 2025
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data
Aishwarya Pothula
Bhavana Akkiraju
Srihari Bandarupalli
Charan D
Santosh Kesiraju
Anil Kumar Vuppala
14
0
0
19 Jun 2025
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho
Yeonkyoun So
Chanwoo Park
Sangmin Lee
Sungmok Jung
Jaejin Lee
VLM
27
0
0
18 Jun 2025
Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding
Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding
Haoran Zhou
Xingchen Song
Brendan Fahy
Qiaochu Song
Binbin Zhang
...
Denglin Jiang
Apurv Verma
Vinay Ramesh
Srivas Prasad
Michele M. Franceschini
18
0
0
13 Jun 2025
Large Language Models for Detection of Life-Threatening Texts
Large Language Models for Detection of Life-Threatening Texts
Thanh Thi Nguyen
Campbell Wilson
Janis Dalins
117
0
0
12 Jun 2025
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
T. Krauß
Hamid Dashtbani
Alexandra Dmitrienko
17
0
0
09 Jun 2025
Towards Universal Offline Black-Box Optimization via Learning Language Model Embeddings
Towards Universal Offline Black-Box Optimization via Learning Language Model Embeddings
Rong-Xi Tan
Ming Chen
Ke Xue
Yao Wang
Yaoyuan Wang
Sheng Fu
Chao Qian
OffRL
15
0
0
08 Jun 2025
Automatic Correction of Writing Anomalies in Hausa Texts
Automatic Correction of Writing Anomalies in Hausa Texts
Ahmad Mustapha Wali
Sergiu Nisioi
52
0
0
04 Jun 2025
Towards a Japanese Full-duplex Spoken Dialogue System
Towards a Japanese Full-duplex Spoken Dialogue System
Atsumoto Ohashi
Shinya Iizuka
Jingjing Jiang
Ryuichiro Higashinaka
AuLLM
37
0
0
03 Jun 2025
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Haruki Sakajo
Yusuke Ide
Justin Vasselli
Yusuke Sakai
Yingtao Tian
Hidetaka Kamigaito
Taro Watanabe
44
0
0
02 Jun 2025
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Sander Land
Catherine Arnett
28
0
0
30 May 2025
HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding
HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding
Emmanuel Anaya Gonzalez
Raven Rothkopf
Sorin Lerner
Nadia Polikarpova
26
0
0
28 May 2025
FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian
Sara Papi
Marco Gaido
L. Bentivogli
Alessio Brutti
Mauro Cettolo
R. Gretter
M. Matassoni
Mohamed Nabih
Matteo Negri
33
0
0
28 May 2025
SEMMA: A Semantic Aware Knowledge Graph Foundation Model
SEMMA: A Semantic Aware Knowledge Graph Foundation Model
Arvindh Arun
Sumit Kumar
M. Nayyeri
Bo Xiong
Ponnurangam Kumaraguru
Antonio Vergari
Steffen Staab
48
0
0
26 May 2025
Building a Functional Machine Translation Corpus for Kpelle
Building a Functional Machine Translation Corpus for Kpelle
Kweku Andoh Yamoah
Jackson Weako
Emmanuel J. Dorley
59
0
0
24 May 2025
Towards Anonymous Neural Network Inference
Towards Anonymous Neural Network Inference
Liao Peiyuan
37
0
0
23 May 2025
An Effective Training Framework for Light-Weight Automatic Speech Recognition Models
An Effective Training Framework for Light-Weight Automatic Speech Recognition Models
Abdul Hannan
Alessio Brutti
Shah Nawaz
Mubashir Noman
71
0
0
22 May 2025
Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility
Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility
Sheng-Fu Wang
Laurent Prevot
Jou-an Chi
Ri-Sheng Huang
Shu-Kai Hsieh
LRM
65
0
0
22 May 2025
Comparative analysis of subword tokenization approaches for Indian languages
Comparative analysis of subword tokenization approaches for Indian languages
Sudhansu Bala Das
Samujjal Choudhury
T. K. Mishra
B. Patra
16
0
0
22 May 2025
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
Yuhao Zhang
Xiangnan Ma
Kaiqi Kou
Peizhuo Liu
Weiqiao Shan
Benyou Wang
Tong Xiao
Yuxin Huang
Zhengtao Yu
Jingbo Zhu
VLM
23
0
0
21 May 2025
Word Level Timestamp Generation for Automatic Speech Recognition and Translation
Word Level Timestamp Generation for Automatic Speech Recognition and Translation
Ke Hu
Krishna Puvvada
Elena Rastorgueva
Zhiwen Chen
He Huang
Shuoyang Ding
Kunal Dhawan
Hainan Xu
Jagadeesh Balam
Boris Ginsburg
35
0
0
21 May 2025
FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation
FuxiMT: Sparsifying Large Language Models for Chinese-Centric Multilingual Machine Translation
Shaolin Zhu
Tianyu Dong
Bo Li
Deyi Xiong
MoE
91
0
0
20 May 2025
Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Ona de Gibert
Joseph Attieh
Teemu Vahtola
Mikko Aulamo
Zihao Li
Raúl Vázquez
Tiancheng Hu
Jörg Tiedemann
SyDa
99
1
0
20 May 2025
Neural Morphological Tagging for Nguni Languages
Neural Morphological Tagging for Nguni Languages
Cael Marquard
Simbarashe Mawere
Francois Meyer
36
0
0
19 May 2025
FreeMesh: Boosting Mesh Generation with Coordinates Merging
FreeMesh: Boosting Mesh Generation with Coordinates Merging
Jian Liu
Haohan Weng
Biwen Lei
Xianghui Yang
Zibo Zhao
Zhuo Chen
Song Guo
Tao Han
Chunchao Guo
108
0
0
19 May 2025
WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection
WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection
Hainan Xu
Vladimir Bataev
Lilit Grigoryan
Boris Ginsburg
55
0
0
19 May 2025
CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning
CellCLIP -- Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning
Mingyu Lu
Ethan Weinberger
Chanwoo Kim
Su-In Lee
23
0
0
16 May 2025
TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation
TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation
Yutong Liu
Feng Xiao
Ziyue Zhang
Yongbin Yu
Cheng Huang
...
Thupten Tsering
Cheng Huang
Gadeng Luosang
Renzeng Duojie
Nyima Tashi
62
2
0
12 May 2025
GIF: Generative Inspiration for Face Recognition at Scale
GIF: Generative Inspiration for Face Recognition at Scale
Saeed Ebrahimi
Sahar Rahimi
Ali Dabouei
Srinjoy Das
Jeremy M. Dawson
Nasser M. Nasrabadi
CVBM
546
0
0
05 May 2025
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Haozheng Luo
Chenghao Qiu
Maojiang Su
Zhihan Zhou
Zoe Mehta
Guo Ye
Jerry Yao-Chieh Hu
Han Liu
AAML
107
1
0
01 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoEVLM
268
2
0
01 May 2025
Improving Informally Romanized Language Identification
Improving Informally Romanized Language Identification
Adrian Benton
Alexander Gutkin
Christo Kirov
Brian Roark
70
0
0
30 Apr 2025
Modes of Sequence Models and Learning Coefficients
Modes of Sequence Models and Learning Coefficients
Zhongtian Chen
Daniel Murfet
128
1
0
25 Apr 2025
Tokenization Matters: Improving Zero-Shot NER for Indic Languages
Tokenization Matters: Improving Zero-Shot NER for Indic Languages
Priyaranjan Pattnayak
Hitesh Laxmichand Patel
Amit Agarwal
84
4
0
23 Apr 2025
Compass-V2 Technical Report
Compass-V2 Technical Report
Sophia Maria
MoELRM
113
0
0
22 Apr 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Kuwain 1.5B: An Arabic SLM via Language Injection
Khalil Hennara
Sara Chrouf
Mohamed Motaism Hamed
Zeina Aldallal
Omar Hadid
Safwan AlModhayan
92
2
0
21 Apr 2025
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Enes Özeren
Yihong Liu
Hinrich Schütze
69
0
0
21 Apr 2025
Sparks of Science: Hypothesis Generation Using Structured Paper Data
Sparks of Science: Hypothesis Generation Using Structured Paper Data
Charles OÑeill
Tirthankar Ghosal
Roberta Răileanu
Mike Walmsley
Thang Bui
Kevin Schawinski
I. Ciucă
LRM
109
4
0
17 Apr 2025
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Yaqian Ning
Tianze Zhang
Yin Zhuang
He Chen
Jun Li
Xuerui Mao
101
0
0
17 Apr 2025
Graph Network for Sign Language Tasks
Graph Network for Sign Language Tasks
Shiwei Gan
Yafeng Yin
Zhiwei Jiang
Hongkai Wen
Lei Xie
Sanglu Lu
SLR
155
0
0
16 Apr 2025
MorphTok: Morphologically Grounded Tokenization for Indian Languages
MorphTok: Morphologically Grounded Tokenization for Indian Languages
Maharaj Brahma
Ayush Maheshwari
A. Singh
D. Adiga
Smruti Bhate
Ganesh Ramakrishnan
Rohit Saluja
Maunendra Sankar Desarkar
118
0
0
14 Apr 2025
RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
Vladimir Bataev
164
0
0
09 Apr 2025
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng
Yi-Chang Chen
Kuan-Yi Lee
Da-shan Shiu
Hung-yi Lee
AuLLM
157
0
0
09 Apr 2025
GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization
GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization
Bojana Ranković
P. Schwaller
BDL
483
1
0
08 Apr 2025
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention
Andrew Kiruluta
Priscilla Burity
Samantha Williams
72
4
0
08 Apr 2025
High-Resource Translation:Turning Abundance into Accessibility
High-Resource Translation:Turning Abundance into Accessibility
Abhiram Reddy Yanampally
46
0
0
08 Apr 2025
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin
Zixu Lin
Haoyu Chen
Panwang Pan
C. Li
Sixiang Chen
Yeying Jin
Wenbo Li
Xinghao Ding
107
2
0
05 Apr 2025
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
Carolina Zheng
Minhui Huang
Dmitrii Pedchenko
Kaushik Rangadurai
Shuaiqiang Wang
...
Yiping Han
Lin Yang
Hangjun Xu
Rong Jin
Shuang Yang
64
0
0
02 Apr 2025
Efficient Federated Learning Tiny Language Models for Mobile Network Feature Prediction
Efficient Federated Learning Tiny Language Models for Mobile Network Feature Prediction
Daniel Becking
Ingo Friese
Karsten Müller
Thomas Buchholz
Mandy Galkow-Schneider
Wojciech Samek
D. Marpe
66
0
0
02 Apr 2025
1234...373839
Next