Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1812.08092
Cited By
A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics
19 December 2018
Martin Gerlach
Francesc Font-Clos
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics"
20 / 20 papers shown
Title
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt
Aaron Mueller
Leshem Choshen
E. Wilcox
Chengxu Zhuang
...
Rafael Mosquera
Bhargavi Paranjape
Adina Williams
Tal Linzen
Ryan Cotterell
51
110
0
10 Apr 2025
BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training
Nikitas Theodoropoulos
Giorgos Filandrianos
Vassilis Lyberatos
Maria Lymperaiou
Giorgos Stamou
SyDa
62
1
0
24 Feb 2025
BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop
Lucas Charpentier
Leshem Choshen
Ryan Cotterell
Mustafa Omer Gul
Michael Y. Hu
...
Candace Ross
Raj Sanjay Shah
Alex Warstadt
Ethan Gotlieb Wilcox
Adina Williams
57
2
0
15 Feb 2025
Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type
Seokwon Song
Taehyun Lee
Jaewoo Ahn
Jae Hyuk Sung
Gunhee Kim
CoGe
90
0
0
10 Feb 2025
A Distributional Perspective on Word Learning in Neural Language Models
Filippo Ficarra
Ryan Cotterell
Alex Warstadt
58
1
0
09 Feb 2025
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
Omnilingual MT Team
Pierre Yves Andrews
Mikel Artetxe
Mariano Coria Meglioli
Marta R. Costa-jussá
...
Eduardo Sánchez
Ioannis Tsiamas
Arina Turkatenko
Albert Ventayol-Boada
Shireen Yates
113
0
0
06 Feb 2025
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
61
14
0
08 Oct 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
Zhewen Shen
Aditya Joshi
Ruey-Cheng Chen
CLL
52
2
0
17 Jun 2024
Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Xueyan Niu
Bo Bai
Lei Deng
Wei Han
44
6
0
14 May 2024
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Nikita Martynov
Mark Baushenko
Anastasia Kozlova
Katerina Kolomeytseva
Aleksandr Abramov
Alena Fenogenova
40
2
0
18 Aug 2023
Quantifying the Dissimilarity of Texts
Benjamin Shade
E. Altmann
35
1
0
03 May 2023
Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files
Igor Cherepanov
Jonathan Geraldi Joewono
Arjan Kuijper
Jörn Kohlhammer
15
0
0
10 Apr 2023
Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Alex Warstadt
Leshem Choshen
Aaron Mueller
Adina Williams
Ethan Gotlieb Wilcox
Chengxu Zhuang
27
54
0
27 Jan 2023
PART: Pre-trained Authorship Representation Transformer
Javier Huertas-Tato
Álvaro Huertas-García
Alejandro Martín
35
8
0
30 Sep 2022
On the State of the Art in Authorship Attribution and Authorship Verification
Jacob Tyo
Bhuwan Dhingra
Zachary Chase Lipton
42
23
0
14 Sep 2022
Controllable Data Generation by Deep Learning: A Review
Shiyu Wang
Yuanqi Du
Xiaojie Guo
Bo Pan
Zhaohui Qin
Liang Zhao
33
28
0
19 Jul 2022
Risks of AI Foundation Models in Education
Su Lin Blodgett
Michael A. Madaio
UQCV
29
14
0
19 Oct 2021
A Statistical Model of Word Rank Evolution
Alex John Quijano
Rick Dale
Suzanne S. Sindi
30
0
0
21 Jul 2021
Critical Thinking for Language Models
Gregor Betz
Christian Voigt
Kyle Richardson
SyDa
ReLM
LRM
AI4CE
26
35
0
15 Sep 2020
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
T. Alshaabi
J. L. Adams
M. V. Arnold
J. Minot
D. R. Dewhurst
A. J. Reagan
C. Danforth
P. Dodds
21
41
0
25 Jul 2020
1