CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

11 March 2021

Papers citing "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"

50 / 143 papers shown

Title
SecureReg: Combining NLP and MLP for Enhanced Detection of Malicious Domain Name Registrations Furkan cColhak Mert İlhan Ecevit Hasan Daug Reiner Creutzburg 16 0 0 06 Jan 2024
Learning Mutually Informed Representations for Characters and Subwords Yilin Wang Xinyi Hu Matthew R. Gormley 33 0 0 14 Nov 2023
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew Eylon Gueta Omer Goldman Reut Tsarfaty 11 1 0 01 Nov 2023
Text Rendering Strategies for Pixel Language Models Jonas F. Lotz Elizabeth Salesky Phillip Rust Desmond Elliott VLM 27 11 0 01 Nov 2023
Learning to Abstract with Nonparametric Variational Information Bottleneck Melika Behjati Fabio Fehr James Henderson SSL 24 1 0 26 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization Lisa Beinborn Yuval Pinter 29 17 0 20 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling Avijit Thawani Saurabh Ghanekar Xiaoyuan Zhu Jay Pujara 32 4 0 17 Oct 2023
Optimized Tokenization for Transcribed Error Correction Tomer Wullach Shlomo E. Chazan 24 0 0 16 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial? Mehdi Ali Michael Fromm Klaudia Thellmann Richard Rutmann Max Lübbering ... Malte Ostendorff Samuel Weinbach R. Sifa Stefan Kesselheim Nicolas Flores-Herr 21 47 0 12 Oct 2023
To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer Md. Mushfiqur Rahman Fardin Ahsan Sakib Fahim Faisal Antonios Anastasopoulos 20 3 0 12 Oct 2023
Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention Huiyin Xue Nikolaos Aletras 30 0 0 11 Oct 2023
Syllable-level lyrics generation from melody exploiting character-level language model Zhe Zhang Karol Lasocki Yi Yu Atsuhiro Takasu 15 5 0 02 Oct 2023
Assessment of Pre-Trained Models Across Languages and Grammars Alberto Muñoz-Ortiz David Vilares Carlos Gómez-Rodríguez 21 2 0 20 Sep 2023
A multimodal deep learning architecture for smoking detection with a small data approach Róbert Lakatos P. Pollner András Hajdu Tamas Joo 24 7 0 19 Sep 2023
Multilingual Text Representation Fahim Faisal 19 0 0 02 Sep 2023
Lightweight Adaptation of Neural Language Models via Subspace Embedding Amit Kumar Jaiswal Haiming Liu 34 2 0 16 Aug 2023
Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT Jing Yang Cong Liu Wendy Deng Dangwei Wu C. Weng Yunyun Zhou Kai Wang 27 20 0 11 Aug 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code Nadezhda Chirkova Sergey Troshin 21 8 0 01 Aug 2023
Biomedical Language Models are Robust to Sub-optimal Tokenization Bernal Jiménez Gutiérrez Huan Sun Yu-Chuan Su 22 6 0 30 Jun 2023
Is Anisotropy Inherent to Transformers? Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 17 3 0 13 Jun 2023
When Vision Fails: Text Attacks Against ViT and OCR Nicholas Boucher Jenny Blessing Ilia Shumailov Ross J. Anderson Nicolas Papernot AAML 34 4 0 12 Jun 2023
Hierarchical Attention Encoder Decoder Asier Mujika BDL 22 3 0 01 Jun 2023
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation Benjamin Minixhofer Jonas Pfeiffer Ivan Vulić 21 16 0 30 May 2023
Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora Svanhvít Lilja Ingólfsdóttir Pétur Orri Ragnarsson H. Jónsson Haukur Barri Símonarson Vilhjálmur Þorsteinsson Vésteinn Snæbjarnarson SyDa 30 9 0 29 May 2023
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding Li Sun F. Luisier Kayhan Batmanghelich D. Florêncio Changrong Zhang VLM 15 6 0 23 May 2023
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer Elizabeth Salesky Neha Verma Philipp Koehn Matt Post 24 14 0 23 May 2023
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models Benjamin Minixhofer Jonas Pfeiffer Ivan Vulić 24 6 0 23 May 2023
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models Peiqin Lin Chengzhi Hu Zheyu Zhang André F. T. Martins Hinrich Schütze 27 1 0 23 May 2023
Language Model Tokenizers Introduce Unfairness Between Languages Aleksandar Petrov Emanuele La Malfa Philip H. S. Torr Adel Bibi 16 97 0 17 May 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers L. Yu Daniel Simig Colin Flaherty Armen Aghajanyan Luke Zettlemoyer M. Lewis 21 84 0 12 May 2023
Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation Francois Meyer Jan Buys 35 8 0 11 May 2023
What is the best recipe for character-level encoder-only modelling? Kris Cao 32 2 0 09 May 2023
An Information Extraction Study: Take In Mind the Tokenization! Christos Theodoropoulos Marie-Francine Moens 29 6 0 27 Mar 2023
Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation Lukas Edman Gabriele Sarti Antonio Toral Gertjan van Noord Arianna Bisazza 16 11 0 28 Feb 2023
Elementwise Language Representation Du-Yeong Kim Jeeeun Kim 28 0 0 27 Feb 2023
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models Davis Liang Hila Gonen Yuning Mao Rui Hou Naman Goyal Marjan Ghazvininejad Luke Zettlemoyer Madian Khabsa 12 72 0 25 Jan 2023
Curriculum Script Distillation for Multilingual Visual Question Answering Khyathi Raghavi Chandu A. Geramifard 25 0 0 17 Jan 2023
What Makes for Good Tokenizers in Vision Transformer? Shengju Qian Yi Zhu Wenbo Li Mu Li Jiaya Jia ViT 37 14 0 21 Dec 2022
Character-Aware Models Improve Visual Text Rendering Rosanne Liu Daniel H Garrette Chitwan Saharia William Chan Adam Roberts Sharan Narang Irina Blok R. Mical Mohammad Norouzi Noah Constant VLM 23 71 0 20 Dec 2022
ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models Jonas Belouadi Steffen Eger 54 24 0 20 Dec 2022
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training Jing-ling Huang Zhengxuan Wu Kyle Mahowald Christopher Potts 24 13 0 19 Dec 2022
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling Nathan Godey Roman Castagné Eric Villemonte de la Clergerie Benoît Sagot 13 3 0 14 Dec 2022
A Survey of Text Representation Methods and Their Genealogy Philipp Siebers Christian Janiesch Patrick Zschech AI4TS 14 9 0 26 Nov 2022
Efficient Transformers with Dynamic Token Pooling Piotr Nawrot J. Chorowski Adrian Lañcucki E. Ponti 20 42 0 17 Nov 2022
Collateral facilitation in humans and language models J. Michaelov Benjamin Bergen 17 11 0 09 Nov 2022
Local Structure Matters Most in Most Languages Louis Clouâtre Prasanna Parthasarathi Amal Zouaq Sarath Chandar 31 1 0 09 Nov 2022
Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes Louis Clouâtre Prasanna Parthasarathi Amal Zouaq Sarath Chandar 33 3 0 09 Nov 2022
Continuous Prompt Tuning Based Textual Entailment Model for E-commerce Entity Typing Yibo Wang Congying Xia Guan Wang Philip Yu 18 6 0 04 Nov 2022
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models Siddhant Arora Siddharth Dalmia Brian Yan Florian Metze A. Black Shinji Watanabe 15 12 0 27 Oct 2022
HashFormers: Towards Vocabulary-independent Pre-trained Transformers Huiyin Xue Nikolaos Aletras 19 4 0 14 Oct 2022