CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

20 October 2020

Papers citing "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters"

50 / 72 papers shown

Title
TempCharBERT: Keystroke Dynamics for Continuous Access Control Based on Pre-trained Language Models Matheus Simão Fabiano Prado Omar Abdul Wahab Anderson Avila 21 0 0 11 Nov 2024
Robust Neural Information Retrieval: An Adversarial and Out-of-distribution Perspective Yu-An Liu Ruqing Zhang Jiafeng Guo Maarten de Rijke Yixing Fan Xueqi Cheng 35 6 0 09 Jul 2024
SpaceByte: Towards Deleting Tokenization from Large Language Modeling Kevin Slagle 37 3 0 22 Apr 2024
We're Calling an Intervention: Exploring Fundamental Hurdles in Adapting Language Models to Nonstandard Text Aarohi Srivastava David Chiang 59 0 0 10 Apr 2024
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations Aina Garí Soler Matthieu Labeau Chloé Clavel VLM 42 2 0 22 Feb 2024
Knowledge of Pretrained Language Models on Surface Information of Tokens Tatsuya Hiraoka Naoaki Okazaki 29 1 0 15 Feb 2024
Anisotropy Is Inherent to Self-Attention in Transformers Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 13 16 0 22 Jan 2024
ALMs: Authorial Language Models for Authorship Attribution Weihang Huang Akira Murakami Jack Grieve DeLMO 11 3 0 22 Jan 2024
Too Much Information: Keeping Training Simple for BabyLMs Lukas Edman Lisa Bylinina 24 4 0 03 Nov 2023
Split-NER: Named Entity Recognition via Two Question-Answering-based Classifications Jatin Arora Youngja Park 24 7 0 30 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling Avijit Thawani Saurabh Ghanekar Xiaoyuan Zhu Jay Pujara 32 4 0 17 Oct 2023
Optimized Tokenization for Transcribed Error Correction Tomer Wullach Shlomo E. Chazan 24 0 0 16 Oct 2023
To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer Md. Mushfiqur Rahman Fardin Ahsan Sakib Fahim Faisal Antonios Anastasopoulos 15 3 0 12 Oct 2023
Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data Xinzhe Li Ming Liu Shang Gao MU 25 8 0 02 Jul 2023
Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language Arij Riabi Menel Mahamdi Djamé Seddah 32 5 0 26 Jun 2023
Typo-Robust Representation Learning for Dense Retrieval Panuthep Tasawong Wuttikorn Ponwitayarat Peerat Limkonchotiwat Can Udomcharoenchaikit E. Chuangsuwanich Sarana Nutanong OOD 40 4 0 17 Jun 2023
Is Anisotropy Inherent to Transformers? Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 17 3 0 13 Jun 2023
AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora R. Joyce Tirth Patel Charles K. Nicholas Edward Raff 15 4 0 09 Jun 2023
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding Li Sun F. Luisier Kayhan Batmanghelich D. Florêncio Changrong Zhang VLM 15 6 0 23 May 2023
Representation Learning for Person or Entity-centric Knowledge Graphs: An Application in Healthcare Christos Theodoropoulos N. Mulligan T. Stappenbeck Joao H. Bettencourt-Silva 14 5 0 09 May 2023
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages Verena Blaschke Hinrich Schütze Barbara Plank 36 14 0 20 Apr 2023
An Information Extraction Study: Take In Mind the Tokenization! Christos Theodoropoulos Marie-Francine Moens 29 6 0 27 Mar 2023
IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining Chihaya Matsuhira Marc A. Kastner Takahiro Komamizu Takatsugu Hirayama Keisuke Doman Yasutomo Kawanishi Ichiro Ide 32 6 0 06 Mar 2023
Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation Lukas Edman Gabriele Sarti Antonio Toral Gertjan van Noord Arianna Bisazza 16 11 0 28 Feb 2023
What Makes for Good Tokenizers in Vision Transformer? Shengju Qian Yi Zhu Wenbo Li Mu Li Jiaya Jia ViT 37 14 0 21 Dec 2022
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training Jing-ling Huang Zhengxuan Wu Kyle Mahowald Christopher Potts 24 13 0 19 Dec 2022
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling Nathan Godey Roman Castagné Eric Villemonte de la Clergerie Benoît Sagot 13 3 0 14 Dec 2022
Subword-Delimited Downsampling for Better Character-Level Translation Lukas Edman Antonio Toral Gertjan van Noord 17 6 0 02 Dec 2022
Word-Level Representation From Bytes For Language Modeling Chul Lee Qipeng Guo Xipeng Qiu 15 1 0 23 Nov 2022
Continuous Prompt Tuning Based Textual Entailment Model for E-commerce Entity Typing Yibo Wang Congying Xia Guan Wang Philip Yu 18 6 0 04 Nov 2022
BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model Yosuke Higuchi Brian Yan Siddhant Arora Tetsuji Ogawa Tetsunori Kobayashi Shinji Watanabe 54 25 0 29 Oct 2022
HashFormers: Towards Vocabulary-independent Pre-trained Transformers Huiyin Xue Nikolaos Aletras 19 4 0 14 Oct 2022
Enriching Biomedical Knowledge for Low-resource Language Through Large-Scale Translation Long Phan Tai Dang H. Tran Trieu H. Trinh Vy Phan Lam D. Chau Minh-Thang Luong 10 8 0 11 Oct 2022
On the State of the Art in Authorship Attribution and Authorship Verification Jacob Tyo Bhuwan Dhingra Zachary Chase Lipton 37 22 0 14 Sep 2022
Review of Natural Language Processing in Pharmacology D. Trajanov Vangel Trajkovski Makedonka Dimitrieva Jovana Dobreva Milos Jovanovik Matej Klemen Alevs vZagar Marko Robnik-vSikonja LM&MA 23 7 0 22 Aug 2022
MockingBERT: A Method for Retroactively Adding Resilience to NLP Models Jan Jezabek A. Singh SILM KELM 15 0 0 21 Aug 2022
Cross-lingual Approaches for the Detection of Adverse Drug Reactions in German from a Patient's Perspective Lisa Raithel Philippe E. Thomas Roland Roller Oliver Sapina Sebastian Möller Pierre Zweigenbaum 16 2 0 03 Aug 2022
Language Modelling with Pixels Phillip Rust Jonas F. Lotz Emanuele Bugliarello Elizabeth Salesky Miryam de Lhoneux Desmond Elliott VLM 38 46 0 14 Jul 2022
Local Byte Fusion for Neural Machine Translation Makesh Narsimhan Sreedhar Xiangpeng Wan Yu-Jie Cheng Junjie Hu 27 4 0 23 May 2022
Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding Abbas Ghaddar Yimeng Wu Sunyam Bagga Ahmad Rashid Khalil Bibi ... Zhefeng Wang Baoxing Huai Xin Jiang Qun Liu Philippe Langlais 22 6 0 21 May 2022
Decorate the Examples: A Simple Method of Prompt Design for Biomedical Relation Extraction Hui-Syuan Yeh Thomas Lavergne Pierre Zweigenbaum 21 10 0 21 Apr 2022
Data Augmentation for Biomedical Factoid Question Answering Dimitris Pappas Prodromos Malakasiotis Ion Androutsopoulos MedIm 14 12 0 10 Apr 2022
CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos Shengyao Zhuang Guido Zuccon OOD 19 30 0 01 Apr 2022
Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech Guangyan Zhang Kaitao Song Xu Tan Daxin Tan Yuzi Yan ... G. Wang Wei Zhou Tao Qin Tan Lee Sheng Zhao SSL 20 21 0 31 Mar 2022
vTTS: visual-text to speech Yoshifumi Nakano Takaaki Saeki Shinnosuke Takamichi Katsuhito Sudoh Hiroshi Saruwatari 13 4 0 28 Mar 2022
Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models Mark Chu Bhargav Srinivasa Desikan E. Nadler Ruggerio L. Sardo Elise Darragh-Ford Douglas Guilbeault 20 0 0 15 Mar 2022
Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost Lihu Chen Gaël Varoquaux Fabian M. Suchanek 13 14 0 15 Mar 2022
The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.5M Screening and Diagnostic Mammograms J. Jeong B. Vey A. Bhimireddy Thomas Kim Thiago Santos ... Christopher R. McAdams Mary S. Newell Imon Banerjee J. Gichoya Hari M. Trivedi 6 4 0 08 Feb 2022
An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification Ferhat Demirkiran Aykut Çayır U. Ünal Hasan Dag 38 42 0 25 Dec 2021
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP Sabrina J. Mielke Zaid Alyafeai Elizabeth Salesky Colin Raffel Manan Dey ... Arun Raja Chenglei Si Wilson Y. Lee Benoît Sagot Samson Tan 30 141 0 20 Dec 2021