Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words

2 January 2021

Papers citing "Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words"

46 / 46 papers shown

Title
SuperBPE: Space Travel for Language Models Alisa Liu J. Hayase Valentin Hofmann Sewoong Oh Noah A. Smith Yejin Choi 53 3 0 17 Mar 2025
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA Lifeng Qiao Peng Ye Yuchen Ren Weiqiang Bai Chaoqi Liang Xinzhu Ma Nanqing Dong W. Ouyang 99 2 0 18 Dec 2024
Evaluating Morphological Compositional Generalization in Large Language Models Mete Ismayilzada Yuan Chiang Jonne Sälevä Hale Sirin Abdullatif Köksal Bhuwan Dhingra Antoine Bosselut Lonneke van der Plas Duygu Ataman 41 2 0 16 Oct 2024
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 Thao Anh Dang Limor Raviv Lukas Galke 27 1 0 15 Oct 2024
Morphological evaluation of subwords vocabulary used by BETO language model Óscar García-Sierra Ana Fernández-Pampillón Cesteros Miguel Ortega-Martín 41 0 0 03 Oct 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Pavel Chizhov Catherine Arnett Elizaveta Korotkova Ivan P. Yamshchikov 50 2 0 06 Sep 2024
Latin Treebanks in Review: An Evaluation of Morphological Tagging Across Time Marisa Hudspeth Brendan O’Connor Laure Thompson 41 1 0 13 Aug 2024
Unsupervised Morphological Tree Tokenizer Qingyang Zhu Xiang Hu Pengyu Ji Wei Wu Kewei Tu 39 0 0 21 Jun 2024
HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew Tzuf Paz-Argaman Itai Mondshine Asaf Achi Mordechai Reut Tsarfaty 40 2 0 06 Jun 2024
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization Dixuan Wang Yanda Li Junyuan Jiang Zepeng Ding Ziqin Luo Guochao Jiang Jiaqing Liang Deqing Yang 34 11 0 27 May 2024
Time Machine GPT Felix Drinkall Eghbal Rahimikia J. Pierrehumbert Stefan Zohren AI4TS AI4CE KELM SyDa 44 3 0 29 Apr 2024
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge Khuyagbaatar Batsuren Ekaterina Vylomova Verna Dankers Tsetsuukhei Delgerbaatar Omri Uzan Yuval Pinter Gábor Bella 42 10 0 20 Apr 2024
A Morphology-Based Investigation of Positional Encodings Poulami Ghosh Shikhar Vashishth Raj Dabre Pushpak Bhattacharyya 34 1 0 06 Apr 2024
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs David R. Mortensen Valentina Izrailevitch Yunze Xiao Hinrich Schütze Leonie Weissweiler 20 5 0 26 Mar 2024
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement Catherine Arnett Pamela D. Rivière Tyler A. Chang Sean Trott 29 2 0 20 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance Omer Goldman Avi Caciularu Matan Eyal Kris Cao Idan Szpektor Reut Tsarfaty 51 23 0 10 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods Omri Uzan Craig W. Schmidt Chris Tanner Yuval Pinter 51 14 0 02 Mar 2024
Tokenization Is More Than Compression Craig W. Schmidt Varshini Reddy Haoran Zhang Alec Alameddine Omri Uzan Yuval Pinter Chris Tanner 61 28 0 28 Feb 2024
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations Aina Garí Soler Matthieu Labeau Chloé Clavel VLM 47 2 0 22 Feb 2024
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain Yanis Labrak Adrien Bazoge Oumaima El Khettari Mickael Rouvier Pacome Constant dit Beaufils ... B. Daille Solen Quiniou Emmanuel Morin P. Gourraud Richard Dufour LM&MA 34 6 0 20 Feb 2024
Paloma: A Benchmark for Evaluating Language Model Fit Ian H. Magnusson Akshita Bhagia Valentin Hofmann Luca Soldaini A. Jha ... Iz Beltagy Hanna Hajishirzi Noah A. Smith Kyle Richardson Jesse Dodge 140 21 0 16 Dec 2023
Impact of Tokenization on LLaMa Russian Adaptation Mikhail Tikhomirov D. Chernyshev 35 4 0 05 Dec 2023
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew Eylon Gueta Omer Goldman Reut Tsarfaty 24 1 0 01 Nov 2023
BERTwich: Extending BERT's Capabilities to Model Dialectal and Noisy Text Aarohi Srivastava David Chiang 30 6 0 31 Oct 2023
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model Leonie Weissweiler Valentin Hofmann Anjali Kantharuban Anna Cai Ritam Dutt ... Abhishek Vijayakumar Haofei Yu Hinrich Schütze Kemal Oflazer David R. Mortensen 38 10 0 23 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization Lisa Beinborn Yuval Pinter 29 17 0 20 Oct 2023
Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation Kevin Krahn D. Tate Andrew C. Lamicela 25 4 0 24 Aug 2023
Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data Xinzhe Li Ming Liu Shang Gao MU 53 8 0 02 Jul 2023
Biomedical Language Models are Robust to Sub-optimal Tokenization Bernal Jiménez Gutiérrez Huan Sun Yu-Chuan Su 22 6 0 30 Jun 2023
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models Benjamin Minixhofer Jonas Pfeiffer Ivan Vulić 37 6 0 23 May 2023
Language Model Tokenizers Introduce Unfairness Between Languages Aleksandar Petrov Emanuele La Malfa Philip Torr Adel Bibi 52 98 0 17 May 2023
Effects of sub-word segmentation on performance of transformer language models Jue Hou Anisia Katinskaia Anh Vu R. Yangarber 21 4 0 09 May 2023
What do Large Language Models Learn beyond Language? Avinash Madasu Shashank Srivastava LRM AI4CE 44 5 0 21 Oct 2022
Incorporating Context into Subword Vocabularies Shaked Yehezkel Yuval Pinter 47 8 0 13 Oct 2022
State-of-the-art generalisation research in NLP: A taxonomy and review Dieuwke Hupkes Mario Giulianelli Verna Dankers Mikel Artetxe Yanai Elazar ... Leila Khalatbari Maria Ryskina Rita Frieske Ryan Cotterell Zhijing Jin 129 95 0 06 Oct 2022
Linguistically inspired roadmap for building biologically reliable protein language models Mai Ha Vu Rahmad Akbar Philippe A. Robert B. Swiatczak Victor Greiff G. K. Sandve Dag Trygve Tryslew Haug 52 35 0 03 Jul 2022
How Adults Understand What Young Children Say Stephan C. Meylan Ruthe Foushee Nicole H. L. Wong Elika Bergelson R. Levy 11 4 0 15 Jun 2022
Improving Tokenisation by Alternative Treatment of Spaces Edward Gow-Smith Harish Tayyar Madabushi Carolina Scarton Aline Villavicencio 37 20 0 08 Apr 2022
Morphological Processing of Low-Resource Languages: Where We Are and What's Next Adam Wiemerslage Miikka Silfverberg Changbing Yang Arya D. McCarthy Garrett Nicolai Eliana Colunga Katharina Kann 36 12 0 16 Mar 2022
Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models Mark Chu Bhargav Srinivasa Desikan E. Nadler Ruggerio L. Sardo Elise Darragh-Ford Douglas Guilbeault 25 0 0 15 Mar 2022
Morphology Without Borders: Clause-Level Morphology Omer Goldman Reut Tsarfaty AILaw 49 3 0 25 Feb 2022
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP Sabrina J. Mielke Zaid Alyafeai Elizabeth Salesky Colin Raffel Manan Dey ... Arun Raja Chenglei Si Wilson Y. Lee Benoît Sagot Samson Tan 34 143 0 20 Dec 2021
Efficient Intent Detection with Dual Sentence Encoders I. Casanueva Tadas Temvcinas D. Gerz Matthew Henderson Ivan Vulić VLM 180 454 0 10 Mar 2020
Probabilistic FastText for Multi-Sense Word Embeddings Ben Athiwaratkun A. Wilson Anima Anandkumar 34 137 0 07 Jun 2018
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Yonghui Wu M. Schuster Zhehuai Chen Quoc V. Le Mohammad Norouzi ... Alex Rudnick Oriol Vinyals G. Corrado Macduff Hughes J. Dean AIMat 718 6,750 0 26 Sep 2016
Efficient Estimation of Word Representations in Vector Space Tomas Mikolov Kai Chen G. Corrado J. Dean 3DV 322 31,297 0 16 Jan 2013