Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.01116
Cited By
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
1 June 2023
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only"
50 / 587 papers shown
Title
Not All Documents Are What You Need for Extracting Instruction Tuning Data
Chi Zhang
Huaping Zhong
Hongtao Li
Chengliang Chai
Jiawei Hong
...
Jiantao Qiu
Ye Yuan
Guoren Wang
Zeang Sheng
Lei Cao
SyDa
12
0
0
18 May 2025
SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache
Qiuyu Zhu
Liang Zhang
Qianxiong Xu
Cheng Long
Jie Zhang
22
0
0
16 May 2025
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Xiaomi LLM-Core Team
Bingquan Xia
B. S.
Cici
Dawei Zhu
...
Yishuo Wang
Yue Yu
Zhenru Lin
Zhichao Song
Zihao Yue
MoE
ReLM
LRM
AI4CE
48
1
0
12 May 2025
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Yishuo Wang
Z. Fu
Jie Cai
Peijun Tang
Hongya Lyu
...
Jie Zhou
Guoyang Zeng
Chaojun Xiao
Xu Han
Zhiyuan Liu
49
0
0
08 May 2025
Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
Xiaowei Zhu
Yubing Ren
Yanan Cao
Xixun Lin
Fang Fang
Yangxi Li
45
0
0
08 May 2025
Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration
Kexin Ding
Mu Zhou
Akshay Chaudhari
Shaoting Zhang
Dimitris N. Metaxas
LM&MA
43
0
0
02 May 2025
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Ning Wang
Zihan Yan
W. Li
Chuan Ma
H. Chen
Tao Xiang
AAML
45
0
0
22 Apr 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Yucheng Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
53
0
0
19 Apr 2025
MAIN: Mutual Alignment Is Necessary for instruction tuning
Fanyi Yang
Jianfeng Liu
Xinsong Zhang
Haoyu Liu
Xixin Cao
Yuefeng Zhan
H. Sun
Weiwei Deng
Feng Sun
Qi Zhang
ALM
27
0
0
17 Apr 2025
ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos
Patrick Giedemann
Pius von Daniken
Jan Deriu
Álvaro Rodrigo
Anselmo Peñas
Mark Cieliebak
37
0
0
17 Apr 2025
Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations
Yiyou Sun
Y. Gai
Lijie Chen
Abhilasha Ravichander
Yejin Choi
D. Song
HILM
57
0
0
17 Apr 2025
Watermarking Needs Input Repetition Masking
David Khachaturov
Robert D. Mullins
Ilia Shumailov
Sumanth Dathathri
WaLM
60
0
0
16 Apr 2025
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Ian H. Magnusson
Nguyen Tai
Ben Bogin
David Heineman
Jena D. Hwang
...
Dirk Groeneveld
Oyvind Tafjord
Noah A. Smith
Pang Wei Koh
Jesse Dodge
ALM
37
0
0
15 Apr 2025
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Dongyang Fan
Vinko Sabolčec
Matin Ansaripour
Ayush Kumar Tarun
Martin Jaggi
Antoine Bosselut
Imanol Schlag
39
0
0
08 Apr 2025
The H-Elena Trojan Virus to Infect Model Weights: A Wake-Up Call on the Security Risks of Malicious Fine-Tuning
Virilo Tejedor
Cristina Zuheros
Carlos Peláez-González
David Herrera-Poyatos
Andrés Herrera-Poyatos
F. Herrera
29
0
0
04 Apr 2025
MegaMath: Pushing the Limits of Open Math Corpora
Fan Zhou
Zengzhi Wang
Nikhil Ranjan
Zhoujun Cheng
Liping Tang
Guowei He
Zhengzhong Liu
Eric P. Xing
LRM
51
1
0
03 Apr 2025
Robustly identifying concepts introduced during chat fine-tuning using crosscoders
Julian Minder
Clement Dumas
Caden Juang
Bilal Chugtai
Neel Nanda
29
0
0
03 Apr 2025
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Jeffrey Li
Mohammadreza Armandpour
Iman Mirzadeh
Sachin Mehta
Vaishaal Shankar
...
Samy Bengio
Oncel Tuzel
Mehrdad Farajtabar
Hadi Pouransari
Fartash Faghri
CLL
KELM
61
0
0
02 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
42
0
0
01 Apr 2025
LLMs for Explainable AI: A Comprehensive Survey
Ahsan Bilal
David Ebert
Beiyu Lin
72
1
0
31 Mar 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
185
0
0
15 Mar 2025
SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable
Jiaxin Zhang
Zechao Li
Wendi Cui
Kamalika Das
Bradley Malin
Sricharan Kumar
46
0
0
13 Mar 2025
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality
Alex Fang
Hadi Pouransari
Matt Jordan
Alexander Toshev
Vaishaal Shankar
Ludwig Schmidt
Tom Gunter
74
0
0
10 Mar 2025
Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning
Mohammad Amin Ghanizadeh
Mohammad Javad Dousti
48
0
0
06 Mar 2025
Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge
Xinyue Cui
Johnny Tian-Zheng Wei
Swabha Swayamdipta
Robin Jia
WaLM
91
1
0
06 Mar 2025
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
Yanzhou Pan
Huawei Lin
Yide Ran
Jiamin Chen
Xiaodong Yu
Weijie Zhao
Denghui Zhang
Zhaozhuo Xu
40
1
0
02 Mar 2025
Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation
Shaharukh Khan
Ayush Tarun
Ali Faraz
Palash Kamble
Vivek Dahiya
Praveen Kumar Pokala
Ashish Kulkarni
Chandra Khatri
Abhinav Ravi
Shubham Agarwal
163
0
0
27 Feb 2025
(Mis)Fitting: A Survey of Scaling Laws
Margaret Li
Sneha Kudugunta
Luke Zettlemoyer
69
2
0
26 Feb 2025
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski
Jon Borchardt
Jason Dunkelberger
Regan Huff
Daniel Lin
Aman Rangapur
Christopher Wilhelm
Kyle Lo
Luca Soldaini
97
0
0
25 Feb 2025
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Jiahui Peng
Xinlin Zhuang
Qiu Jiantao
Ren Ma
Jing Yu
Tianyi Bai
Zeang Sheng
41
1
0
24 Feb 2025
LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu
Dawei Zhu
Guangxiang Zhao
Zhuocheng Yu
Junfeng Ran
Xiangyu Wong
Lin Sun
Sujian Li
48
0
0
24 Feb 2025
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
Layba Fiaz
Munief Hassan Tahir
Sana Shams
Sarmad Hussain
51
0
0
24 Feb 2025
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
Shahriar Kabir Nahin
R. N. Nandi
Sagor Sarker
Quazi Sarwar Muhtaseem
Md. Kowsher
Apu Chandraw Shill
Md Ibrahim
Mehadi Hasan Menon
Tareq Al Muntasir
Firoj Alam
68
0
0
24 Feb 2025
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Yen-Che Hsiao
Abhishek Dutta
LRM
ReLM
ELM
66
0
0
24 Feb 2025
BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models
Yupeng Chang
Yi-Ju Chang
Yuan Wu
AI4CE
ALM
95
0
0
24 Feb 2025
Chitrarth: Bridging Vision and Language for a Billion People
Shaharukh Khan
Ayush Tarun
Abhinav Ravi
Ali Faraz
Akshat Patidar
Praveen Kumar Pokala
Anagha Bhangare
Raja Kolla
Chandra Khatri
Shubham Agarwal
VLM
126
1
0
21 Feb 2025
GneissWeb: Preparing High Quality Data for LLMs at Scale
Hajar Emami-Gohari
S. Kadhe
Syed Yousaf Shah. Constantin Adam
Abdulhamid A. Adebayo
Praneet Adusumilli
...
Issei Yoshida
Syed Zawad
Petros Zerfos
Yi Zhou
Bishwaranjan Bhattacharjee
52
1
0
19 Feb 2025
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
Cristian Gutierrez
LRM
69
0
0
17 Feb 2025
What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets
Marco Antonio Stranisci
Christian Hardmeier
60
0
0
17 Feb 2025
Valuable Hallucinations: Realizable Non-realistic Propositions
Qiucheng Chen
Bo Wang
LRM
59
0
0
16 Feb 2025
Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation
Hieu Nguyen
Zihao He
Shoumik Atul Gandre
Ujjwal Pasupulety
Sharanya Kumari Shivakumar
Kristina Lerman
HILM
59
1
0
16 Feb 2025
Matina: A Large-Scale 73B Token Persian Text Corpus
Sara Bourbour Hosseinbeigi
Fatemeh Taherinezhad
Heshaam Faili
Hamed Baghbani
Fatemeh Nadi
Mostafa Amiri
76
0
0
13 Feb 2025
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining
Daouda Sow
Herbert Woisetschläger
Saikiran Bulusu
Shiqiang Wang
Hans-Arno Jacobsen
Yingbin Liang
59
0
0
10 Feb 2025
Quantifying Correlations of Machine Learning Models
Yuanyuan Li
Neeraj Sarna
Yang Lin
85
0
0
06 Feb 2025
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Daniel Tamayo
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
KELM
93
3
0
04 Feb 2025
CE-LoRA: Computation-Efficient LoRA Fine-Tuning for Language Models
Guanduo Chen
Yutong He
Yipeng Hu
Kun Yuan
Binhang Yuan
54
0
0
03 Feb 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Min Zhang
LM&MA
AILaw
93
154
0
28 Jan 2025
LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration
Panayiotis Christou
Md. Zahidul Islam
Yuzhang Lin
Jingwei Xiong
35
0
0
24 Jan 2025
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
Junyu Chen
Han Cai
Junsong Chen
E. Xie
Shang Yang
Haotian Tang
Muyang Li
Yunfan LU
Song Han
DiffM
69
36
0
20 Jan 2025
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach
Alireza Ghaffari
Sharareh Younesian
Boxing Chen
Vahid Partovi Nia
M. Asgharian
MQ
63
0
0
17 Jan 2025
1
2
3
4
...
10
11
12
Next