Fast WordPiece Tokenization

Fast WordPiece Tokenization

31 December 2020

Alexandru Salcianu

Papers citing "Fast WordPiece Tokenization"

19 / 19 papers shown

Title
Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss Zhuoang Cai Zehan Li Yi Liu Liyuan Guo Yangqiu Song 36 0 0 24 Apr 2025
Annotative Indexing Charles L. A. Clarke 7 0 0 09 Nov 2024
MotionGlot: A Multi-Embodied Motion Generation Model Sudarshan Harithas Srinath Sridhar 82 1 0 22 Oct 2024
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization Kohei Tsuji Tatsuya Hiraoka Yuchang Cheng Tomoya Iwakura 47 1 0 10 Sep 2024
Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences Patrick Haller Lena S. Bolliger Lena Ann Jäger 42 1 0 07 Jun 2024
Revisiting character-level adversarial attacks Elias Abad Rocamora Yongtao Wu Fanghui Liu Grigorios G. Chrysos V. Cevher AAML 39 3 0 07 May 2024
I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey Noah Lewis J. L. Bez Suren Byna 62 0 0 16 Apr 2024
A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks B.K. Casey Joanna C. S. Santos George Perry 63 5 0 15 Mar 2024
Subobject-level Image Tokenization Delong Chen Samuel Cahyawijaya Jianfeng Liu Baoyuan Wang Pascale Fung VLM OCL 60 7 0 22 Feb 2024
Leveraging Domain Adaptation and Data Augmentation to Improve Quránic IR in English and Arabic Vera Pavlova 31 2 0 05 Dec 2023
On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based Multilingual Model Nohil Park Joonsuk Park Kang Min Yoo Sungroh Yoon 36 3 0 14 Nov 2023
DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew Shaltiel Shmidman Avi Shmidman Moshe Koppel 30 7 0 31 Aug 2023
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models Orevaoghene Ahia Sachin Kumar Hila Gonen Jungo Kasai David R. Mortensen Noah A. Smith Yulia Tsvetkov 53 82 0 23 May 2023
Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing Tatsuya Hiraoka Tomoya Iwakura 20 0 0 21 Apr 2023
Language Model Classifier Aligns Better with Physician Word Sensitivity than XGBoost on Readmission Prediction Grace Yang Mingzi Cao L. Jiang Xujin C. Liu Alexander T. M. Cheung Hannah Weiss Davied Kurland Kyunghyun Cho Eric K. Oermann LM&MA 24 3 0 13 Nov 2022
MaxMatch-Dropout: Subword Regularization for WordPiece Tatsuya Hiraoka 54 8 0 09 Sep 2022
pNLP-Mixer: an Efficient all-MLP Architecture for Language Francesco Fusco Damian Pascual Peter W. J. Staar Diego Antognini 37 29 0 09 Feb 2022
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP Sabrina J. Mielke Zaid Alyafeai Elizabeth Salesky Colin Raffel Manan Dey ... Arun Raja Chenglei Si Wilson Y. Lee Benoît Sagot Samson Tan 34 143 0 20 Dec 2021
Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models Qinyuan Ye Madian Khabsa M. Lewis Sinong Wang Xiang Ren Aaron Jaech 39 5 0 16 Oct 2021