Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.06874
Cited By
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
11 March 2021
J. Clark
Dan Garrette
Iulia Turc
John Wieting
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"
50 / 143 papers shown
Title
SecureReg: Combining NLP and MLP for Enhanced Detection of Malicious Domain Name Registrations
Furkan cColhak
Mert İlhan Ecevit
Hasan Daug
Reiner Creutzburg
16
0
0
06 Jan 2024
Learning Mutually Informed Representations for Characters and Subwords
Yilin Wang
Xinyi Hu
Matthew R. Gormley
33
0
0
14 Nov 2023
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew
Eylon Gueta
Omer Goldman
Reut Tsarfaty
11
1
0
01 Nov 2023
Text Rendering Strategies for Pixel Language Models
Jonas F. Lotz
Elizabeth Salesky
Phillip Rust
Desmond Elliott
VLM
27
11
0
01 Nov 2023
Learning to Abstract with Nonparametric Variational Information Bottleneck
Melika Behjati
Fabio Fehr
James Henderson
SSL
24
1
0
26 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
29
17
0
20 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Avijit Thawani
Saurabh Ghanekar
Xiaoyuan Zhu
Jay Pujara
32
4
0
17 Oct 2023
Optimized Tokenization for Transcribed Error Correction
Tomer Wullach
Shlomo E. Chazan
24
0
0
16 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
21
47
0
12 Oct 2023
To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer
Md. Mushfiqur Rahman
Fardin Ahsan Sakib
Fahim Faisal
Antonios Anastasopoulos
20
3
0
12 Oct 2023
Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head Attention
Huiyin Xue
Nikolaos Aletras
30
0
0
11 Oct 2023
Syllable-level lyrics generation from melody exploiting character-level language model
Zhe Zhang
Karol Lasocki
Yi Yu
Atsuhiro Takasu
15
5
0
02 Oct 2023
Assessment of Pre-Trained Models Across Languages and Grammars
Alberto Muñoz-Ortiz
David Vilares
Carlos Gómez-Rodríguez
21
2
0
20 Sep 2023
A multimodal deep learning architecture for smoking detection with a small data approach
Róbert Lakatos
P. Pollner
András Hajdu
Tamas Joo
24
7
0
19 Sep 2023
Multilingual Text Representation
Fahim Faisal
19
0
0
02 Sep 2023
Lightweight Adaptation of Neural Language Models via Subspace Embedding
Amit Kumar Jaiswal
Haiming Liu
34
2
0
16 Aug 2023
Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT
Jing Yang
Cong Liu
Wendy Deng
Dangwei Wu
C. Weng
Yunyun Zhou
Kai Wang
27
20
0
11 Aug 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Nadezhda Chirkova
Sergey Troshin
21
8
0
01 Aug 2023
Biomedical Language Models are Robust to Sub-optimal Tokenization
Bernal Jiménez Gutiérrez
Huan Sun
Yu-Chuan Su
22
6
0
30 Jun 2023
Is Anisotropy Inherent to Transformers?
Nathan Godey
Eric Villemonte de la Clergerie
Benoît Sagot
17
3
0
13 Jun 2023
When Vision Fails: Text Attacks Against ViT and OCR
Nicholas Boucher
Jenny Blessing
Ilia Shumailov
Ross J. Anderson
Nicolas Papernot
AAML
34
4
0
12 Jun 2023
Hierarchical Attention Encoder Decoder
Asier Mujika
BDL
22
3
0
01 Jun 2023
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation
Benjamin Minixhofer
Jonas Pfeiffer
Ivan Vulić
21
16
0
30 May 2023
Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora
Svanhvít Lilja Ingólfsdóttir
Pétur Orri Ragnarsson
H. Jónsson
Haukur Barri Símonarson
Vilhjálmur Þorsteinsson
Vésteinn Snæbjarnarson
SyDa
30
9
0
29 May 2023
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding
Li Sun
F. Luisier
Kayhan Batmanghelich
D. Florêncio
Changrong Zhang
VLM
15
6
0
23 May 2023
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
Elizabeth Salesky
Neha Verma
Philipp Koehn
Matt Post
24
14
0
23 May 2023
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models
Benjamin Minixhofer
Jonas Pfeiffer
Ivan Vulić
24
6
0
23 May 2023
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Peiqin Lin
Chengzhi Hu
Zheyu Zhang
André F. T. Martins
Hinrich Schütze
27
1
0
23 May 2023
Language Model Tokenizers Introduce Unfairness Between Languages
Aleksandar Petrov
Emanuele La Malfa
Philip H. S. Torr
Adel Bibi
16
97
0
17 May 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
L. Yu
Daniel Simig
Colin Flaherty
Armen Aghajanyan
Luke Zettlemoyer
M. Lewis
21
84
0
12 May 2023
Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation
Francois Meyer
Jan Buys
35
8
0
11 May 2023
What is the best recipe for character-level encoder-only modelling?
Kris Cao
32
2
0
09 May 2023
An Information Extraction Study: Take In Mind the Tokenization!
Christos Theodoropoulos
Marie-Francine Moens
29
6
0
27 Mar 2023
Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation
Lukas Edman
Gabriele Sarti
Antonio Toral
Gertjan van Noord
Arianna Bisazza
16
11
0
28 Feb 2023
Elementwise Language Representation
Du-Yeong Kim
Jeeeun Kim
28
0
0
27 Feb 2023
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
Davis Liang
Hila Gonen
Yuning Mao
Rui Hou
Naman Goyal
Marjan Ghazvininejad
Luke Zettlemoyer
Madian Khabsa
12
72
0
25 Jan 2023
Curriculum Script Distillation for Multilingual Visual Question Answering
Khyathi Raghavi Chandu
A. Geramifard
25
0
0
17 Jan 2023
What Makes for Good Tokenizers in Vision Transformer?
Shengju Qian
Yi Zhu
Wenbo Li
Mu Li
Jiaya Jia
ViT
37
14
0
21 Dec 2022
Character-Aware Models Improve Visual Text Rendering
Rosanne Liu
Daniel H Garrette
Chitwan Saharia
William Chan
Adam Roberts
Sharan Narang
Irina Blok
R. Mical
Mohammad Norouzi
Noah Constant
VLM
23
71
0
20 Dec 2022
ByGPT5: End-to-End Style-conditioned Poetry Generation with Token-free Language Models
Jonas Belouadi
Steffen Eger
54
24
0
20 Dec 2022
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training
Jing-ling Huang
Zhengxuan Wu
Kyle Mahowald
Christopher Potts
24
13
0
19 Dec 2022
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
Nathan Godey
Roman Castagné
Eric Villemonte de la Clergerie
Benoît Sagot
13
3
0
14 Dec 2022
A Survey of Text Representation Methods and Their Genealogy
Philipp Siebers
Christian Janiesch
Patrick Zschech
AI4TS
14
9
0
26 Nov 2022
Efficient Transformers with Dynamic Token Pooling
Piotr Nawrot
J. Chorowski
Adrian Lañcucki
E. Ponti
20
42
0
17 Nov 2022
Collateral facilitation in humans and language models
J. Michaelov
Benjamin Bergen
17
11
0
09 Nov 2022
Local Structure Matters Most in Most Languages
Louis Clouâtre
Prasanna Parthasarathi
Amal Zouaq
Sarath Chandar
31
1
0
09 Nov 2022
Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes
Louis Clouâtre
Prasanna Parthasarathi
Amal Zouaq
Sarath Chandar
33
3
0
09 Nov 2022
Continuous Prompt Tuning Based Textual Entailment Model for E-commerce Entity Typing
Yibo Wang
Congying Xia
Guan Wang
Philip Yu
18
6
0
04 Nov 2022
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models
Siddhant Arora
Siddharth Dalmia
Brian Yan
Florian Metze
A. Black
Shinji Watanabe
15
12
0
27 Oct 2022
HashFormers: Towards Vocabulary-independent Pre-trained Transformers
Huiyin Xue
Nikolaos Aletras
19
4
0
14 Oct 2022
Previous
1
2
3
Next