ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.01116
  4. Cited By
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
  with Web Data, and Web Data Only

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

1 June 2023
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
ArXiv (abs)PDFHTML

Papers citing "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only"

50 / 131 papers shown
Title
Essential-Web v1.0: 24T tokens of organized web data
Essential-Web v1.0: 24T tokens of organized web data
Essential AI
Andrew Hojel
Michael Pust
Tim Romanski
Yash Vanjani
...
Platon Mazarakis
Saad Jamal
Saurabh Srivastava
Somanshu Singla
Ashish Vaswani
15
0
0
17 Jun 2025
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting
Fei Ding
Baiqiao Wang
CLL
91
0
0
11 Jun 2025
GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture
GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture
GigaChat team
Mamedov Valentin
Evgenii Kosarev
Gregory Leleytner
Ilya Shchuckin
...
Ruslan Gaitukiev
Arkadiy Shatenov
Alena Fenogenova
Nikita Savushkin
Fedor Minkin
85
0
0
11 Jun 2025
dots.llm1 Technical Report
dots.llm1 Technical Report
Bi Huo
Bin Tu
Cheng Qin
Da Zheng
Debing Zhang
...
Yuqiu Ji
Ze Wen
Zhenhai Liu
Zichao Li
Zilong Liao
MoE
47
0
0
06 Jun 2025
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Thao Nguyen
Yang Li
O. Yu. Golovneva
Luke Zettlemoyer
Sewoong Oh
Ludwig Schmidt
Xian Li
OnRL
146
0
0
05 Jun 2025
AstroMLab 4: Benchmark-Topping Performance in Astronomy Q&A with a 70B-Parameter Domain-Specialized Reasoning Model
Tijmen de Haan
Yuan-Sen Ting
Tirthankar Ghosal
Tuan Dung Nguyen
Alberto Accomazzi
Emily Herron
Vanessa Lama
Boyao Wang
Azton Wells
Nesar Ramachandra
ALMELMAI4MHLRM
104
0
0
23 May 2025
SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache
SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache
Qiuyu Zhu
Liang Zhang
Qianxiong Xu
Cheng Long
Jie Zhang
99
0
0
16 May 2025
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Xiaomi LLM-Core Team
Bingquan Xia
Bo Shen
Cici
Dawei Zhu
...
Yun Wang
Yue Yu
Zhenru Lin
Zhichao Song
Zihao Yue
MoEReLMLRMAI4CE
169
7
0
12 May 2025
Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
Xiaowei Zhu
Yubing Ren
Yanan Cao
Xixun Lin
Fang Fang
Yangxi Li
190
0
0
08 May 2025
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Ning Wang
Zihan Yan
W. Li
Chuan Ma
H. Chen
Tao Xiang
AAML
154
0
0
22 Apr 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Yucheng Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
151
0
0
19 Apr 2025
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Julian Minder
Clement Dumas
Caden Juang
Bilal Chugtai
Neel Nanda
172
1
0
03 Apr 2025
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Jeffrey Li
Mohammadreza Armandpour
Iman Mirzadeh
Sachin Mehta
Vaishaal Shankar
...
Samy Bengio
Oncel Tuzel
Mehrdad Farajtabar
Hadi Pouransari
Fartash Faghri
CLLKELM
150
0
0
02 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
227
0
0
01 Apr 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
454
0
0
15 Mar 2025
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
Yanzhou Pan
Huawei Lin
Yide Ran
Jiamin Chen
Xiaodong Yu
Weijie Zhao
Denghui Zhang
Zhaozhuo Xu
116
1
0
02 Mar 2025
Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation
Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation
Shaharukh Khan
Ayush Tarun
Ali Faraz
Palash Kamble
Vivek Dahiya
Praveen Kumar Pokala
Ashish Kulkarni
Chandra Khatri
Abhinav Ravi
Shubham Agarwal
439
1
0
27 Feb 2025
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski
Aman Rangapur
Jon Borchardt
Jason Dunkelberger
Regan Huff
Daniel Lin
Aman Rangapur
Christopher Wilhelm
Kyle Lo
Luca Soldaini
174
7
0
25 Feb 2025
LongAttn: Selecting Long-context Training Data via Token-level Attention
LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu
Dawei Zhu
Guangxiang Zhao
Zhuocheng Yu
Junfeng Ran
Xiangyu Wong
Lin Sun
Sujian Li
108
2
0
24 Feb 2025
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
Layba Fiaz
Munief Hassan Tahir
Sana Shams
Sarmad Hussain
95
0
0
24 Feb 2025
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Yen-Che Hsiao
Abhishek Dutta
LRMReLMELM
116
0
0
24 Feb 2025
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Unsupervised Topic Models are Data Mixers for Pre-training Language Models
Jiahui Peng
Xinlin Zhuang
Qiu Jiantao
Ren Ma
Jing Yu
Tianyi Bai
Zeang Sheng
96
2
0
24 Feb 2025
GneissWeb: Preparing High Quality Data for LLMs at Scale
GneissWeb: Preparing High Quality Data for LLMs at Scale
Hajar Emami-Gohari
S. Kadhe
Syed Yousaf Shah. Constantin Adam
Abdulhamid A. Adebayo
Praneet Adusumilli
...
Issei Yoshida
Syed Zawad
Petros Zerfos
Yi Zhou
Bishwaranjan Bhattacharjee
66
1
0
19 Feb 2025
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
TinyEmo: Scaling down Emotional Reasoning via Metric Projection
Cristian Gutierrez
LRM
265
0
0
17 Feb 2025
Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation
Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation
Hieu Nguyen
Zihao He
Shoumik Atul Gandre
Ujjwal Pasupulety
Sharanya Kumari Shivakumar
Kristina Lerman
HILM
128
2
0
16 Feb 2025
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
Shahriar Kabir Nahin
R. N. Nandi
Sagor Sarker
Quazi Sarwar Muhtaseem
Md. Kowsher
Apu Chandraw Shill
Md Ibrahim
Mehadi Hasan Menon
Tareq Al Muntasir
Firoj Alam
185
0
0
16 Feb 2025
Valuable Hallucinations: Realizable Non-realistic Propositions
Valuable Hallucinations: Realizable Non-realistic Propositions
Qiucheng Chen
Bo Wang
LRM
138
0
0
16 Feb 2025
Matina: A Large-Scale 73B Token Persian Text Corpus
Matina: A Large-Scale 73B Token Persian Text Corpus
Sara Bourbour Hosseinbeigi
Fatemeh Taherinezhad
Heshaam Faili
Hamed Baghbani
Fatemeh Nadi
Mostafa Amiri
164
0
0
13 Feb 2025
Quantifying Correlations of Machine Learning Models
Quantifying Correlations of Machine Learning Models
Yuanyuan Li
Neeraj Sarna
Yang Lin
177
0
0
06 Feb 2025
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Daniel Tamayo
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
KELM
160
5
0
04 Feb 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Min Zhang
LM&MAAILaw
228
176
0
28 Jan 2025
LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration
LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration
Panayiotis Christou
Md. Zahidul Islam
Yuzhang Lin
Jingwei Xiong
44
0
0
24 Jan 2025
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
Junyu Chen
Han Cai
Junsong Chen
Enze Xie
Shang Yang
Haotian Tang
Zhekai Zhang
Yaojie Lu
Song Han
DiffM
171
7
0
20 Jan 2025
Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions
Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions
Doaa Mahmud
Hadeel Hajmohamed
Shamma Almentheri
Shamma Alqaydi
Lameya Aldhaheri
R. A. Khalil
Nasir Saeed
AI4TS
99
12
0
08 Jan 2025
HuRef: HUman-REadable Fingerprint for Large Language Models
HuRef: HUman-REadable Fingerprint for Large Language Models
Boyi Zeng
Cheng Zhou
Yuncong Hu
Yi Xu
Chenghu Zhou
Xiang Wang
Yu Yu
Zhouhan Lin
137
12
0
08 Jan 2025
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
Hadi Pouransari
Chun-Liang Li
Jen-Hao Rick Chang
Pavan Kumar Anasosalu Vasu
Cem Koc
Vaishaal Shankar
Oncel Tuzel
93
11
0
08 Jan 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
201
13
0
31 Dec 2024
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Paint Outside the Box: Synthesizing and Selecting Training Data for Visual Grounding
Zilin Du
Haoxin Li
Jianfei Yu
Boyang Li
490
0
0
01 Dec 2024
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
Ruben Ohana
Michael McCabe
Lucas Meyer
Rudy Morel
Fruzsina J. Agocs
...
François Rozet
Liam Parker
M. Cranmer
S. Ho
Shirley Ho
PINNAI4CE
191
23
1
30 Nov 2024
FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from
  the Web
FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web
Cheng-Wei Lin
Wan-Hsuan Hsieh
Kai-Xin Guan
Chan-Jan Hsu
Chia-Chen Kuo
Chuan-Lin Lai
Chung-Wei Chung
Ming-Jen Wang
Da-shan Shiu
74
1
0
25 Nov 2024
Training Bilingual LMs with Data Constraints in the Targeted Language
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto
Maartje ter Hoeve
He Bai
Natalie Schluter
David Grangier
197
1
0
20 Nov 2024
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate
Zhiqi Bu
Xiaomeng Jin
Bhanukiran Vinzamuri
Anil Ramakrishna
Kai-Wei Chang
Volkan Cevher
Mingyi Hong
MU
157
14
0
29 Oct 2024
Reverse Modeling in Large Language Models
Reverse Modeling in Large Language Models
S. Yu
Yuanchen Xu
Cunxiao Du
Yanying Zhou
Minghui Qiu
Q. Sun
Hao Zhang
Jiawei Wu
157
2
0
13 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
138
2
0
11 Oct 2024
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Kyuyoung Kim
Ah Jeong Seo
Hao Liu
Jinwoo Shin
Kimin Lee
48
5
0
04 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
205
5
0
30 Sep 2024
Scaling Optimal LR Across Token Horizons
Scaling Optimal LR Across Token Horizons
Johan Bjorck
Alon Benhaim
Vishrav Chaudhary
Furu Wei
Xia Song
144
8
0
30 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLMAuLLM
167
12
0
26 Sep 2024
Harnessing Diversity for Important Data Selection in Pretraining Large
  Language Models
Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Chi Zhang
Huaping Zhong
Kuan Zhang
Chengliang Chai
Rui Wang
...
Lei Cao
Ju Fan
Ye Yuan
Guoren Wang
Conghui He
TDI
100
10
0
25 Sep 2024
Small Language Models: Survey, Measurements, and Insights
Small Language Models: Survey, Measurements, and Insights
Zhenyan Lu
Xiang Li
Dongqi Cai
Rongjie Yi
Fangming Liu
Xiwen Zhang
Nicholas D. Lane
Mengwei Xu
ObjDLRM
157
58
0
24 Sep 2024
123
Next