ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 1,950 papers shown
Title
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion
  of Domain-Specific LLMs
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs
Chengyuan Liu
Shihang Wang
Lizhi Qing
Kun Kuang
Yangyang Kang
Changlong Sun
Fei Wu
56
3
0
02 Oct 2024
FedPT: Federated Proxy-Tuning of Large Language Models on
  Resource-Constrained Edge Devices
FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices
Zhidong Gao
Yu Zhang
Zhenxiao Zhang
Yanmin Gong
Yuanxiong Guo
62
1
0
01 Oct 2024
Enhancing High-order Interaction Awareness in LLM-based Recommender
  Model
Enhancing High-order Interaction Awareness in LLM-based Recommender Model
Xinfeng Wang
Jin Cui
Fumiyo Fukumoto
Yoshimi Suzuki
61
4
0
30 Sep 2024
Universal Medical Image Representation Learning with Compositional
  Decoders
Universal Medical Image Representation Learning with Compositional Decoders
Kaini Wang
Ling Yang
Siping Zhou
Guangquan Zhou
Wentao Zhang
Bin Cui
Shuo Li
SSLMedIm
80
0
0
30 Sep 2024
AfriHuBERT: A self-supervised speech representation model for African languages
AfriHuBERT: A self-supervised speech representation model for African languages
Jesujoba Oluwadara Alabi
Xuechen Liu
Dietrich Klakow
Junichi Yamagishi
VLM
84
3
0
30 Sep 2024
Exploring Language Model Generalization in Low-Resource Extractive QA
Exploring Language Model Generalization in Low-Resource Extractive QA
Saptarshi Sengupta
Wenpeng Yin
Preslav Nakov
Shreya Ghosh
Suhang Wang
96
1
0
27 Sep 2024
Convolutional Signal Propagation: A Simple Scalable Algorithm for
  Hypergraphs
Convolutional Signal Propagation: A Simple Scalable Algorithm for Hypergraphs
Pavel Procházka
Marek Dědič
Lukáš Bajer
GNN
50
0
0
26 Sep 2024
LangSAMP: Language-Script Aware Multilingual Pretraining
LangSAMP: Language-Script Aware Multilingual Pretraining
Yihong Liu
Haotian Ye
Chunlan Ma
Mingyang Wang
Hinrich Schütze
VLM
246
0
0
26 Sep 2024
How Transliterations Improve Crosslingual Alignment
How Transliterations Improve Crosslingual Alignment
Yihong Liu
Mingyang Wang
Amir Hossein Kargaran
Ayyoob Imani
Orgest Xhelili
Haotian Ye
Chunlan Ma
François Yvon
Hinrich Schütze
89
4
0
25 Sep 2024
EuroLLM: Multilingual Language Models for Europe
EuroLLM: Multilingual Language Models for Europe
Pedro Henrique Martins
Patrick Fernandes
Joao Alves
Nuno M. Guerreiro
Ricardo Rei
...
Pierre Colombo
Barry Haddow
José G. C. de Souza
Alexandra Birch
André F. T. Martins
88
40
0
24 Sep 2024
Multilingual Transfer and Domain Adaptation for Low-Resource Languages
  of Spain
Multilingual Transfer and Domain Adaptation for Low-Resource Languages of Spain
Yuanchang Luo
Zhanglin Wu
Daimeng Wei
Hengchao Shang
Zongyao Li
...
Shaojun Li
Jinlong Yang
Yuhao Xie
Jiawei Zheng Bin Wei
Hao Yang
40
1
0
24 Sep 2024
Machine Translation Advancements of Low-Resource Indian Languages by
  Transfer Learning
Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning
Bin Wei
Jiawei Zhen
Zongyao Li
Zhanglin Wu
Daimeng Wei
...
Yuanchang Luo
Hengchao Shang
Jinlong Yang
Yuhao Xie
Hao Yang
VLM
45
2
0
24 Sep 2024
dnaGrinder: a lightweight and high-capacity genomic foundation model
dnaGrinder: a lightweight and high-capacity genomic foundation model
Qihang Zhao
Chi Zhang
Weixiong Zhang
53
0
0
24 Sep 2024
HW-TSC's Submission to the CCMT 2024 Machine Translation Tasks
HW-TSC's Submission to the CCMT 2024 Machine Translation Tasks
Zhanglin Wu
Yuanchang Luo
Daimeng Wei
Jiawei Zheng
Bin Wei
...
Jiaxin Guo
Shaojun Li
Mengli Zhu
Ning Xie
Hao Yang
93
1
0
23 Sep 2024
Choose the Final Translation from NMT and LLM hypotheses Using MBR
  Decoding: HW-TSC's Submission to the WMT24 General MT Shared Task
Choose the Final Translation from NMT and LLM hypotheses Using MBR Decoding: HW-TSC's Submission to the WMT24 General MT Shared Task
Zhanglin Wu
Daimeng Wei
Zongyao Li
Hengchao Shang
Jiaxin Guo
Shaojun Li
Zhiqiang Rao
Yuanchang Luo
Ning Xie
Hao Yang
57
5
0
23 Sep 2024
Cross-Domain Content Generation with Domain-Specific Small Language
  Models
Cross-Domain Content Generation with Domain-Specific Small Language Models
Ankit Maloo
Abhinav Garg
CLL
47
0
0
19 Sep 2024
An Efficient Self-Learning Framework For Interactive Spoken Dialog
  Systems
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems
Hitesh Tulsiani
David M. Chan
Shalini Ghosh
Garima Lalwani
Prabhat Pandey
Ankish Bansal
Sri Garimella
Ariya Rastrow
Björn Hoffmeister
53
0
0
16 Sep 2024
PixelBytes: Catching Unified Representation for Multimodal Generation
PixelBytes: Catching Unified Representation for Multimodal Generation
Fabien Furfaro
44
0
0
16 Sep 2024
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and
  URLs Detection and Classification
DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
Abdelkader El Mahdaouy
Salima Lamsiyah
Meryem Janati Idrissi
H. Alami
Zakaria Yartaoui
Ismail Berrada
53
3
0
13 Sep 2024
Optimizing Rare Word Accuracy in Direct Speech Translation with a
  Retrieval-and-Demonstration Approach
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach
Siqi Li
Danni Liu
Jan Niehues
58
1
0
13 Sep 2024
Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift Generalization
Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift Generalization
Gentiana Rashiti
G. Karunaratne
Mrinmaya Sachan
Abu Sebastian
Abbas Rahimi
RALM
230
0
0
12 Sep 2024
TeXBLEU: Automatic Metric for Evaluate LaTeX Format
TeXBLEU: Automatic Metric for Evaluate LaTeX Format
Kyudan Jung
N. Kim
Hyongon Ryu
Sieun Hyeon
Seung-jun Lee
Hyeok-jae Lee
77
1
0
10 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
85
5
0
06 Sep 2024
Open Language Data Initiative: Advancing Low-Resource Machine
  Translation for Karakalpak
Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak
Mukhammadsaid Mamasaidov
Abror Shopulatov
VLM
54
4
0
06 Sep 2024
The AdEMAMix Optimizer: Better, Faster, Older
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini
Pierre Ablin
David Grangier
ODL
91
13
0
05 Sep 2024
Multi-modal Situated Reasoning in 3D Scenes
Multi-modal Situated Reasoning in 3D Scenes
Xiongkun Linghu
Jiangyong Huang
Xuesong Niu
Xiaojian Ma
Baoxiong Jia
Siyuan Huang
111
19
0
04 Sep 2024
Resource-Efficient Adaptation of Speech Foundation Models for
  Multi-Speaker ASR
Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR
Weiqing Wang
Kunal Dhawan
Taejin Park
Krishna Puvvada
Ivan Medennikov
Somshubra Majumdar
He Huang
Jagadeesh Balam
Boris Ginsburg
72
2
0
02 Sep 2024
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts
Yingfa Chen
Chenlong Hu
Cong Feng
Chenyang Song
Shi Yu
Xu Han
Zhiyuan Liu
Maosong Sun
60
0
0
02 Sep 2024
Towards Tailored Recovery of Lexical Diversity in Literary Machine
  Translation
Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation
Esther Ploeger
Huiyuan Lai
Rik van Noord
Antonio Toral
60
2
0
30 Aug 2024
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions
Sully F. Chen
Robert J. Steele
Glen M. Hocky
Beakal Lemeneh
S. Lad
Eric Oermann
AI4CE
89
0
0
29 Aug 2024
Language Adaptation on a Tight Academic Compute Budget: Tokenizer
  Swapping Works and Pure bfloat16 Is Enough
Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough
Konstantin Dobler
Gerard de Melo
78
1
0
28 Aug 2024
Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras
Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras
Pratik K. Mishra
Irene Ballester
Andrea Iaboni
Bing Ye
Kristine Newman
Alex Mihailidis
Shehroz S. Khan
81
2
0
28 Aug 2024
Positional Description for Numerical Normalization
Positional Description for Numerical Normalization
Deepanshu Gupta
Javier Latorre
3DGS
56
0
0
22 Aug 2024
Distributional Properties of Subword Regularization
Distributional Properties of Subword Regularization
Marco Cognetta
Vilém Zouhar
Naoaki Okazaki
67
0
0
21 Aug 2024
Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking
  Across Diverse Vocabularies
Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies
Sai Koneru
Matthias Huck
M. Exel
Jan Niehues
65
0
0
21 Aug 2024
Goldfish: Monolingual Language Models for 350 Languages
Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang
Catherine Arnett
Zhuowen Tu
Benjamin Bergen
LRM
132
10
0
19 Aug 2024
Language-Informed Beam Search Decoding for Multilingual Machine
  Translation
Language-Informed Beam Search Decoding for Multilingual Machine Translation
Yilin Yang
Stefan Lee
Prasad Tadepalli
50
1
0
11 Aug 2024
AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in
  Multilingual Corpora
AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
Aleksandr Fedchin
Isabel Cooperman
Pramit Chaudhuri
Joseph P. Dexter
78
0
0
08 Aug 2024
EMTeC: A Corpus of Eye Movements on Machine-Generated Texts
EMTeC: A Corpus of Eye Movements on Machine-Generated Texts
Lena S. Bolliger
Patrick Haller
Isabelle Caroline Rose Cretton
D. R. Reich
Tannon Kew
Lena Ann Jäger
60
5
0
08 Aug 2024
Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking
  Optimization
Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization
Zhou Qin
Kai Yuan
Pratik Lahiri
Wenyang Liu
BDL
122
0
0
08 Aug 2024
Semantics or spelling? Probing contextual word embeddings with
  orthographic noise
Semantics or spelling? Probing contextual word embeddings with orthographic noise
Jacob A. Matthews
John R. Starr
Marten van Schijndel
70
2
0
08 Aug 2024
EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora
EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora
Faisal Qarah
65
5
0
07 Aug 2024
SETN: Stock Embedding Enhanced with Textual and Network Information
SETN: Stock Embedding Enhanced with Textual and Network Information
Takehiro Takayanagi
Hiroki Sakaji
Kiyoshi Izumi
AIFin
143
2
0
06 Aug 2024
Compromising Embodied Agents with Contextual Backdoor Attacks
Compromising Embodied Agents with Contextual Backdoor Attacks
Aishan Liu
Yuguang Zhou
Xianglong Liu
Tianyuan Zhang
Siyuan Liang
...
Tianlin Li
Junqi Zhang
Wenbo Zhou
Qing Guo
Dacheng Tao
LLMAGAAML
117
13
0
06 Aug 2024
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh
  Tokenization
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization
Yiwen Chen
Yikai Wang
Yihao Luo
Ziyi Wang
Zilong Chen
Jun Zhu
Chi Zhang
Guosheng Lin
77
31
0
05 Aug 2024
Batching BPE Tokenization Merges
Batching BPE Tokenization Merges
Alexander P. Morgan
70
0
0
05 Aug 2024
SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese
  Large Language Models
SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models
Shujuan Zhao
Lingfeng Qiao
Kangyang Luo
Qian-Wen Zhang
Junru Lu
Di Yin
AIFin
80
3
0
05 Aug 2024
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
Shuhao Guan
Derek Greene
100
8
0
05 Aug 2024
Improving Multilingual Neural Machine Translation by Utilizing Semantic
  and Linguistic Features
Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features
Mengyu Bu
Shuhao Gu
Yang Feng
118
5
0
02 Aug 2024
Leveraging Entailment Judgements in Cross-Lingual Summarisation
Leveraging Entailment Judgements in Cross-Lingual Summarisation
Huajian Zhang
Laura Perez-Beltrachini
HILM
76
0
0
01 Aug 2024
Previous
12345...373839
Next