ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2101.00027
  4. Cited By
The Pile: An 800GB Dataset of Diverse Text for Language Modeling

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

31 December 2020
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
Charles Foster
Jason Phang
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
    AIMat
ArXivPDFHTML

Papers citing "The Pile: An 800GB Dataset of Diverse Text for Language Modeling"

50 / 360 papers shown
Title
Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
Ibne Farabi Shihab
Sanjeda Akter
Anuj Sharma
Mamba
48
0
0
13 May 2025
Learning Dynamics in Continual Pre-Training for Large Language Models
Learning Dynamics in Continual Pre-Training for Large Language Models
Xingjin Wang
Howe Tissue
Lu Wang
Linjing Li
D. Zeng
CLL
29
0
0
12 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
60
0
0
06 May 2025
Incentivizing Inclusive Contributions in Model Sharing Markets
Incentivizing Inclusive Contributions in Model Sharing Markets
Enpei Zhang
Jingyi Chai
Rui Ye
Yanfeng Wang
Siheng Chen
TDI
FedML
134
0
0
05 May 2025
Demystifying optimized prompts in language models
Demystifying optimized prompts in language models
Rimon Melamed
Lucas H. McCabe
H. H. Huang
39
0
0
04 May 2025
ReCIT: Reconstructing Full Private Data from Gradient in Parameter-Efficient Fine-Tuning of Large Language Models
ReCIT: Reconstructing Full Private Data from Gradient in Parameter-Efficient Fine-Tuning of Large Language Models
Jin Xie
Ruishi He
Songze Li
Xiaojun Jia
Shouling Ji
SILM
AAML
66
0
0
29 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
89
0
0
29 Apr 2025
Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report
Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report
Paul Kassianik
Baturay Saglam
Alexander Chen
Blaine Nelson
Anu Vellore
...
Hyrum Anderson
Kojin Oshiba
Omar Santos
Yaron Singer
Amin Karbasi
PILM
61
0
0
28 Apr 2025
Modes of Sequence Models and Learning Coefficients
Modes of Sequence Models and Learning Coefficients
Zhongtian Chen
Daniel Murfet
82
1
0
25 Apr 2025
Studying Small Language Models with Susceptibilities
Studying Small Language Models with Susceptibilities
Garrett Baker
George Wang
Jesse Hoogland
Daniel Murfet
AAML
75
1
0
25 Apr 2025
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Carlo Siebenschuh
Kyle Hippe
Ozan Gokdemir
Alexander Brace
A. Khan
...
V. Vishwanath
R. Stevens
Arvind Ramanathan
Ian Foster
Robert Underwood
MoE
44
0
0
23 Apr 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
55
0
0
23 Apr 2025
Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software
Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software
Anjali Khurana
Xiaotian Su
April Yi Wang
Parmit K. Chilana
33
0
0
22 Apr 2025
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
Sid Black
Asa Cooper Stickland
Jake Pencharz
Oliver Sourbut
Michael Schmatz
Jay Bailey
Ollie Matthews
Ben Millwood
Alex Remedios
Alan Cooney
ELM
134
0
0
21 Apr 2025
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Kaihang Pan
Wang Lin
Zhongqi Yue
Tenglong Ao
Liyu Jia
Wei Zhao
Juncheng Billy Li
Siliang Tang
Hanwang Zhang
46
2
0
20 Apr 2025
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
Yamato Arai
Yuma Ichikawa
MQ
29
0
0
13 Apr 2025
Not All Data Are Unlearned Equally
Not All Data Are Unlearned Equally
Aravind Krishnan
Siva Reddy
Marius Mosbach
MU
136
0
0
07 Apr 2025
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Ruikang Liu
Yuxuan Sun
Manyi Zhang
Haoli Bai
Xianzhi Yu
Tiezheng Yu
C. Yuan
Lu Hou
MQ
LRM
31
5
0
07 Apr 2025
A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models
A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models
Yuantao Zhang
Zhankui Yang
AAML
35
0
0
05 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
40
0
0
01 Apr 2025
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Hung-Yueh Chiang
Chi-chih Chang
N. Frumkin
Kai-Chiang Wu
Mohamed S. Abdelfattah
Diana Marculescu
MQ
131
0
0
28 Mar 2025
Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters
Adaptive Rank Allocation: Speeding Up Modern Transformers with RaNA Adapters
Roberto Garcia
Jerry Liu
Daniel Sorvisto
Sabri Eyuboglu
90
0
0
23 Mar 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
150
0
0
15 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
Yuhang Liu
Dong Gong
Erdun Gao
Zhen Zhang
Biwei Huang
Mingming Gong
Anton van den Hengel
Javen Qinfeng Shi
J. Shi
148
0
0
12 Mar 2025
From Idea to Implementation: Evaluating the Influence of Large Language Models in Software Development -- An Opinion Paper
From Idea to Implementation: Evaluating the Influence of Large Language Models in Software Development -- An Opinion Paper
Sargam Yadav
Asifa Mehmood Qureshi
Abhishek Kaushik
Shubham Sharma
Roisin Loughran
...
. Nikhil Singh
Padraic O'Hara
Pranay Jaiswal
Roshan Chandru
David Lillis
56
1
0
10 Mar 2025
InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference
Tianyu Cui
Song-Jun Xu
Artem Moskalev
Shuwei Li
Tommaso Mansi
Mangal Prakash
Rui Liao
BDL
71
1
0
06 Mar 2025
Position: Model Collapse Does Not Mean What You Think
Position: Model Collapse Does Not Mean What You Think
Rylan Schaeffer
Joshua Kazdan
Alvan Caleb Arulandu
Sanmi Koyejo
60
0
0
05 Mar 2025
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Qingpei Guo
Kaiyou Song
Zipeng Feng
Ziping Ma
Qinglong Zhang
...
Yunxiao Sun
Tai-WeiChang
Jingdong Chen
Ming Yang
Jun Zhou
MLLM
VLM
82
3
0
26 Feb 2025
FADE: Why Bad Descriptions Happen to Good Features
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
Aakriti Jain
Elena Golimblevskaia
Patrick Kahardipraja
Thomas Wiegand
Wojciech Samek
Sebastian Lapuschkin
130
0
0
24 Feb 2025
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Yen-Che Hsiao
Abhishek Dutta
LRM
ReLM
ELM
61
0
0
24 Feb 2025
Tokenization is Sensitive to Language Variation
Tokenization is Sensitive to Language Variation
Anna Wegmann
Dong Nguyen
David Jurgens
77
1
0
24 Feb 2025
LongAttn: Selecting Long-context Training Data via Token-level Attention
LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu
Dawei Zhu
Guangxiang Zhao
Zhuocheng Yu
Junfeng Ran
Xiangyu Wong
Lin Sun
Sujian Li
41
0
0
24 Feb 2025
Beyond Release: Access Considerations for Generative AI Systems
Beyond Release: Access Considerations for Generative AI Systems
Irene Solaiman
Rishi Bommasani
Dan Hendrycks
Ariel Herbert-Voss
Yacine Jernite
Aviya Skowron
Andrew Trask
60
1
0
23 Feb 2025
Multilingual Language Model Pretraining using Machine-translated Data
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang
Yao Lu
Maurice Weber
Max Ryabinin
David Ifeoluwa Adelani
Yihong Chen
Raphael Tang
Pontus Stenetorp
LRM
75
2
0
20 Feb 2025
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
Yucheng Shi
Tianze Yang
Canyu Chen
Quanzheng Li
Tianming Liu
X. Li
Ninghao Liu
MedIm
46
2
0
18 Feb 2025
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
Cristian Gutierrez
LRM
62
0
0
17 Feb 2025
Prediction hubs are context-informed frequent tokens in LLMs
Prediction hubs are context-informed frequent tokens in LLMs
Beatrix M. G. Nielsen
Iuri Macocco
Marco Baroni
126
1
0
17 Feb 2025
Associative Recurrent Memory Transformer
Associative Recurrent Memory Transformer
Ivan Rodkin
Yuri Kuratov
Aydar Bulatov
Mikhail Burtsev
68
2
0
17 Feb 2025
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Da Xiao
Qingye Meng
Shengping Li
Xingyuan Yuan
MoE
AI4CE
58
1
0
13 Feb 2025
Do we really have to filter out random noise in pre-training data for language models?
Do we really have to filter out random noise in pre-training data for language models?
Jinghan Ru
Yuxin Xie
Xianwei Zhuang
Yuguo Yin
Zhihui Guo
Zhiming Liu
Qianli Ren
Yuexian Zou
83
2
0
10 Feb 2025
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM's Education
Yanhao Jia
Xinyi Wu
Hao Li
Qinglin Zhang
Yuxiao Hu
Shuai Zhao
Wenqi Fan
46
2
0
09 Feb 2025
Fine-Tuned LLMs are "Time Capsules" for Tracking Societal Bias Through Books
Fine-Tuned LLMs are "Time Capsules" for Tracking Societal Bias Through Books
Sangmitra Madhusudan
Robert D Morabito
Skye Reid
Nikta Gohari Sadr
Ali Emami
56
0
0
07 Feb 2025
Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
Evaluating Small Language Models for News Summarization: Implications and Factors Influencing Performance
Borui Xu
Yao Chen
Zeyi Wen
Weiguo Liu
Bingsheng He
73
1
0
02 Feb 2025
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
J. P. Muñoz
Jinjie Yuan
Nilesh Jain
Mamba
70
1
0
28 Jan 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Erik Cambria
LM&MA
AILaw
93
153
0
28 Jan 2025
SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs
SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs
Mohammad Mozaffari
Amir Yazdanbakhsh
Zhao Zhang
M. Dehnavi
78
5
0
28 Jan 2025
Merino: Entropy-driven Design for Generative Language Models on IoT Devices
Merino: Entropy-driven Design for Generative Language Models on IoT Devices
Youpeng Zhao
Ming Lin
Huadong Tang
Qiang Wu
Jun Wang
75
0
0
28 Jan 2025
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion
Zhan Ling
Kang Liu
Kai Yan
Y. Yang
Weijian Lin
Ting-Han Fan
Lingfeng Shen
Zhengyin Du
Jiecao Chen
ReLM
ELM
LRM
44
3
0
25 Jan 2025
Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
Ali Naseh
Niloofar Mireshghallah
53
0
0
20 Jan 2025
The Geometry of Tokens in Internal Representations of Large Language Models
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
88
4
0
17 Jan 2025
12345678
Next