A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

19 December 2018

Papers citing "A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics"

20 / 20 papers shown

Title
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora Alex Warstadt Aaron Mueller Leshem Choshen E. Wilcox Chengxu Zhuang ... Rafael Mosquera Bhargavi Paranjape Adina Williams Tal Linzen Ryan Cotterell 51 110 0 10 Apr 2025
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training Nikitas Theodoropoulos Giorgos Filandrianos Vassilis Lyberatos Maria Lymperaiou Giorgos Stamou SyDa 62 1 0 24 Feb 2025
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop Lucas Charpentier Leshem Choshen Ryan Cotterell Mustafa Omer Gul Michael Y. Hu ... Candace Ross Raj Sanjay Shah Alex Warstadt Ethan Gotlieb Wilcox Adina Williams 57 2 0 15 Feb 2025
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type Seokwon Song Taehyun Lee Jaewoo Ahn Jae Hyuk Sung Gunhee Kim CoGe 90 0 0 10 Feb 2025
A Distributional Perspective on Word Learning in Neural Language Models Filippo Ficarra Ryan Cotterell Alex Warstadt 58 1 0 09 Feb 2025
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation Omnilingual MT Team Pierre Yves Andrews Mikel Artetxe Mariano Coria Meglioli Marta R. Costa-jussá ... Eduardo Sánchez Ioannis Tsiamas Arina Turkatenko Albert Ventayol-Boada Shireen Yates 113 0 0 06 Feb 2025
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 61 14 0 08 Oct 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM Zhewen Shen Aditya Joshi Ruey-Cheng Chen CLL 52 2 0 17 Jun 2024
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory Xueyan Niu Bo Bai Lei Deng Wei Han 44 6 0 14 May 2024
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages Nikita Martynov Mark Baushenko Anastasia Kozlova Katerina Kolomeytseva Aleksandr Abramov Alena Fenogenova 40 2 0 18 Aug 2023
Quantifying the Dissimilarity of Texts Benjamin Shade E. Altmann 35 1 0 03 May 2023
Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files Igor Cherepanov Jonathan Geraldi Joewono Arjan Kuijper Jörn Kohlhammer 15 0 0 10 Apr 2023
Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus Alex Warstadt Leshem Choshen Aaron Mueller Adina Williams Ethan Gotlieb Wilcox Chengxu Zhuang 27 54 0 27 Jan 2023
PART: Pre-trained Authorship Representation Transformer Javier Huertas-Tato Álvaro Huertas-García Alejandro Martín 35 8 0 30 Sep 2022
On the State of the Art in Authorship Attribution and Authorship Verification Jacob Tyo Bhuwan Dhingra Zachary Chase Lipton 42 23 0 14 Sep 2022
Controllable Data Generation by Deep Learning: A Review Shiyu Wang Yuanqi Du Xiaojie Guo Bo Pan Zhaohui Qin Liang Zhao 33 28 0 19 Jul 2022
Risks of AI Foundation Models in Education Su Lin Blodgett Michael A. Madaio UQCV 29 14 0 19 Oct 2021
A Statistical Model of Word Rank Evolution Alex John Quijano Rick Dale Suzanne S. Sindi 30 0 0 21 Jul 2021
Critical Thinking for Language Models Gregor Betz Christian Voigt Kyle Richardson SyDa ReLM LRM AI4CE 26 35 0 15 Sep 2020
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter T. Alshaabi J. L. Adams M. V. Arnold J. Minot D. R. Dewhurst A. J. Reagan C. Danforth P. Dodds 21 41 0 25 Jul 2020