ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.01116
  4. Cited By
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
  with Web Data, and Web Data Only

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

1 June 2023
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
ArXivPDFHTML

Papers citing "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only"

50 / 587 papers shown
Title
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning
NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning
Eli Schwartz
Leshem Choshen
J. Shtok
Sivan Doveh
Leonid Karlinsky
Assaf Arbelle
28
13
0
30 Mar 2024
Aurora-M: The First Open Source Multilingual Language Model Red-teamed
  according to the U.S. Executive Order
Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order
Taishi Nakamura
Mayank Mishra
Simone Tedeschi
Yekun Chai
Jason T Stillerman
...
Virendra Mehta
Matthew Blumberg
Victor May
Huu Nguyen
S. Pyysalo
LRM
45
7
0
30 Mar 2024
A Review of Multi-Modal Large Language and Vision Models
A Review of Multi-Modal Large Language and Vision Models
Kilian Carolan
Laura Fennelly
Alan F. Smeaton
VLM
22
23
0
28 Mar 2024
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Hyunbyung Park
Sukyung Lee
Gyoungjin Gim
Yungi Kim
Dahyun Kim
Chanjun Park
VLM
42
0
0
28 Mar 2024
Rejection Improves Reliability: Training LLMs to Refuse Unknown
  Questions Using RL from Knowledge Feedback
Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
Hongshen Xu
Zichen Zhu
Situo Zhang
Da Ma
Shuai Fan
Lu Chen
Kai Yu
HILM
39
35
0
27 Mar 2024
Juru: Legal Brazilian Large Language Model from Reputable Sources
Juru: Legal Brazilian Large Language Model from Reputable Sources
Roseval Malaquias Junior
Ramon Pires
R. Romero
R. Nogueira
ELM
AILaw
34
0
0
26 Mar 2024
ILLUMINER: Instruction-tuned Large Language Models as Few-shot Intent
  Classifier and Slot Filler
ILLUMINER: Instruction-tuned Large Language Models as Few-shot Intent Classifier and Slot Filler
Paramita Mirza
Viju Sudhi
S. Sahoo
Sinchana Ramakanth Bhat
25
4
0
26 Mar 2024
Decoding the Digital Fine Print: Navigating the potholes in Terms of
  service/ use of GenAI tools against the emerging need for Transparent and
  Trustworthy Tech Futures
Decoding the Digital Fine Print: Navigating the potholes in Terms of service/ use of GenAI tools against the emerging need for Transparent and Trustworthy Tech Futures
Sundaraparipurnan Narayanan
34
0
0
26 Mar 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
57
64
0
25 Mar 2024
Ultra Low-Cost Two-Stage Multimodal System for Non-Normative Behavior
  Detection
Ultra Low-Cost Two-Stage Multimodal System for Non-Normative Behavior Detection
Albert Lu
Stephen Cranefield
32
0
0
24 Mar 2024
A Little Leak Will Sink a Great Ship: Survey of Transparency for Large
  Language Models from Start to Finish
A Little Leak Will Sink a Great Ship: Survey of Transparency for Large Language Models from Start to Finish
Masahiro Kaneko
Timothy Baldwin
PILM
34
3
0
24 Mar 2024
Cost-Efficient Large Language Model Serving for Multi-turn Conversations
  with CachedAttention
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
Bin Gao
Zhuomin He
Puru Sharma
Qingxuan Kang
Djordje Jevdjic
Junbo Deng
Xingkun Yang
Zhou Yu
Pengfei Zuo
71
45
0
23 Mar 2024
Evidence-Driven Retrieval Augmented Response Generation for Online
  Misinformation
Evidence-Driven Retrieval Augmented Response Generation for Online Misinformation
Zhenrui Yue
Huimin Zeng
Yimeng Lu
Lanyu Shang
Yang Zhang
Dong Wang
RALM
OffRL
36
19
0
22 Mar 2024
Chain-of-Interaction: Enhancing Large Language Models for Psychiatric
  Behavior Understanding by Dyadic Contexts
Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts
Guangzeng Han
Weisi Liu
Xiaolei Huang
Brian Borsari
36
21
0
20 Mar 2024
BadEdit: Backdooring large language models by model editing
BadEdit: Backdooring large language models by model editing
Yanzhou Li
Tianlin Li
Kangjie Chen
Jian Zhang
Shangqing Liu
Wenhan Wang
Tianwei Zhang
Yang Liu
SyDa
AAML
KELM
56
53
0
20 Mar 2024
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Jeffrey Cheng
Marc Marone
Orion Weller
Dawn J Lawrie
Daniel Khashabi
Benjamin Van Durme
67
13
0
19 Mar 2024
RankPrompt: Step-by-Step Comparisons Make Language Models Better
  Reasoners
RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners
Chi Hu
Yuan Ge
Xiangnan Ma
Hang Cao
Qiang Li
Yonghua Yang
Tong Xiao
Jingbo Zhu
ReLM
ELM
LRM
ALM
45
9
0
19 Mar 2024
Loops On Retrieval Augmented Generation (LoRAG)
Loops On Retrieval Augmented Generation (LoRAG)
Ayush Thakur
Rashmi Vashisth
19
1
0
18 Mar 2024
MT-PATCHER: Selective and Extendable Knowledge Distillation from Large
  Language Models for Machine Translation
MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation
Jiahuan Li
Shanbo Cheng
Shujian Huang
Jiajun Chen
35
7
0
14 Mar 2024
Language models scale reliably with over-training and on downstream
  tasks
Language models scale reliably with over-training and on downstream tasks
S. Gadre
Georgios Smyrnis
Vaishaal Shankar
Suchin Gururangan
Mitchell Wortsman
...
Y. Carmon
Achal Dave
Reinhard Heckel
Niklas Muennighoff
Ludwig Schmidt
ALM
ELM
LRM
108
40
0
13 Mar 2024
ORPO: Monolithic Preference Optimization without Reference Model
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong
Noah Lee
James Thorne
OSLM
42
209
0
12 Mar 2024
Materials science in the era of large language models: a perspective
Materials science in the era of large language models: a perspective
Ge Lei
Ronan Docherty
Samuel J. Cooper
45
18
0
11 Mar 2024
Development of a Reliable and Accessible Caregiving Language Model
  (CaLM)
Development of a Reliable and Accessible Caregiving Language Model (CaLM)
B. Parmanto
Bayu Aryoyudanta
Wilbert Soekinto
Agus Setiawan
Yuhan Wang
Haomin Hu
Andi Saptono
Yong K Choi
32
0
0
11 Mar 2024
ConspEmoLLM: Conspiracy Theory Detection Using an Emotion-Based Large
  Language Model
ConspEmoLLM: Conspiracy Theory Detection Using an Emotion-Based Large Language Model
Zhiwei Liu
Boyang Liu
Paul Thompson
Kailai Yang
Sophia Ananiadou
40
3
0
11 Mar 2024
Unsupervised Real-Time Hallucination Detection based on the Internal
  States of Large Language Models
Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models
Weihang Su
Changyue Wang
Qingyao Ai
Hu Yiran
Zhijing Wu
Yujia Zhou
Yiqun Liu
HILM
47
28
0
11 Mar 2024
From Instructions to Constraints: Language Model Alignment with
  Automatic Constraint Verification
From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification
Fei Wang
Chao Shang
Sarthak Jain
Shuai Wang
Qiang Ning
Bonan Min
Vittorio Castelli
Yassine Benajiba
Dan Roth
ALM
22
8
0
10 Mar 2024
FLAP: Flow-Adhering Planning with Constrained Decoding in LLMs
FLAP: Flow-Adhering Planning with Constrained Decoding in LLMs
Shamik Roy
Sailik Sengupta
Daniele Bonadiman
Saab Mansour
Arshit Gupta
29
5
0
09 Mar 2024
SaulLM-7B: A pioneering Large Language Model for Law
SaulLM-7B: A pioneering Large Language Model for Law
Pierre Colombo
T. Pires
Malik Boudiaf
Dominic Culver
Rui Melo
...
Andre F. T. Martins
Fabrizio Esposito
Vera Lúcia Raposo
Sofia Morgado
Michael Desa
ELM
AILaw
52
66
0
06 Mar 2024
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
  Datasets
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
Hossein Aboutalebi
Hwanjun Song
Yusheng Xie
Arshit Gupta
Justin Sun
Hang Su
Igor Shalyminov
Nikolaos Pappas
Siffi Singh
Saab Mansour
DiffM
EGVM
48
4
0
05 Mar 2024
Found in the Middle: How Language Models Use Long Contexts Better via
  Plug-and-Play Positional Encoding
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
Zhenyu Zhang
Runjin Chen
Shiwei Liu
Zhewei Yao
Olatunji Ruwase
Beidi Chen
Xiaoxia Wu
Zhangyang Wang
34
26
0
05 Mar 2024
Modeling Collaborator: Enabling Subjective Vision Classification With
  Minimal Human Effort via LLM Tool-Use
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Imad Eddine Toubal
Aditya Avinash
N. Alldrin
Jan Dlabal
Wenlei Zhou
...
Chun-Ta Lu
Howard Zhou
Ranjay Krishna
Ariel Fuxman
Tom Duerig
VLM
75
7
0
05 Mar 2024
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem
Omar Mahmoud
Niloofar Mireshghallah
Hyunwoo J. Kim
Yulia Tsvetkov
Yejin Choi
Sherif Saad
Santu Rana
50
19
0
05 Mar 2024
NiNformer: A Network in Network Transformer with Token Mixing Generated
  Gating Function
NiNformer: A Network in Network Transformer with Token Mixing Generated Gating Function
Abdullah Nazhat Abdullah
Tarkan Aydin
39
0
0
04 Mar 2024
ParallelPARC: A Scalable Pipeline for Generating Natural-Language
  Analogies
ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies
Oren Sultan
Yonatan Bitton
Ron Yosef
Dafna Shahaf
34
9
0
02 Mar 2024
Direct Alignment of Draft Model for Speculative Decoding with
  Chat-Fine-Tuned LLMs
Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs
Raghavv Goel
Mukul Gagrani
Wonseok Jeon
Junyoung Park
Mingu Lee
Christopher Lott
ALM
34
5
0
29 Feb 2024
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Jiantao Qiu
Haijun Lv
Zhenjiang Jin
Rui Wang
Wenchang Ning
...
Zhongying Tu
Lin Dahua
Yu Qiao
Hang Yan
Conghui He
36
6
0
29 Feb 2024
MIKO: Multimodal Intention Knowledge Distillation from Large Language
  Models for Social-Media Commonsense Discovery
MIKO: Multimodal Intention Knowledge Distillation from Large Language Models for Social-Media Commonsense Discovery
Feihong Lu
Weiqi Wang
Yangyifei Luo
Ziqin Zhu
Qingyun Sun
...
Haochen Shi
Shiqi Gao
Qian Li
Yangqiu Song
Jianxin Li
VLM
42
2
0
28 Feb 2024
KoDialogBench: Evaluating Conversational Understanding of Language
  Models with Korean Dialogue Benchmark
KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark
Seongbo Jang
Seonghyeon Lee
Hwanjo Yu
ELM
29
0
0
27 Feb 2024
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
Omkar Thawakar
Ashmal Vayani
Salman Khan
Hisham Cholakal
Rao M. Anwer
M. Felsberg
Timothy Baldwin
Eric P. Xing
Fahad Shahbaz Khan
48
31
0
26 Feb 2024
SelectIT: Selective Instruction Tuning for LLMs via Uncertainty-Aware Self-Reflection
SelectIT: Selective Instruction Tuning for LLMs via Uncertainty-Aware Self-Reflection
Liangxin Liu
Xuebo Liu
Derek F. Wong
Dongfang Li
Ziyi Wang
Baotian Hu
Min Zhang
53
17
0
26 Feb 2024
ChatMusician: Understanding and Generating Music Intrinsically with LLM
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Ti-Fen Pan
Hanfeng Lin
Yi Wang
Zeyue Tian
Shangda Wu
...
Gus Xia
Roger Dannenberg
Wei Xue
Shiyin Kang
Yike Guo
101
35
0
25 Feb 2024
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Zhipeng Xu
Zhenghao Liu
Yukun Yan
Zhiyuan Liu
Ge Yu
Chenyan Xiong
CLIP
OnRL
27
4
0
22 Feb 2024
LLMs with Industrial Lens: Deciphering the Challenges and Prospects -- A
  Survey
LLMs with Industrial Lens: Deciphering the Challenges and Prospects -- A Survey
Ashok Urlana
Charaka Vinayak Kumar
Ajeet Kumar Singh
B. Garlapati
S. Chalamala
Rahul Mishra
35
5
0
22 Feb 2024
On the Tip of the Tongue: Analyzing Conceptual Representation in Large
  Language Models with Reverse-Dictionary Probe
On the Tip of the Tongue: Analyzing Conceptual Representation in Large Language Models with Reverse-Dictionary Probe
Ningyu Xu
Qi Zhang
Menghan Zhang
Peng Qian
Xuanjing Huang
LRM
67
3
0
22 Feb 2024
Eagle: Ethical Dataset Given from Real Interactions
Eagle: Ethical Dataset Given from Real Interactions
Masahiro Kaneko
Danushka Bollegala
Timothy Baldwin
44
3
0
22 Feb 2024
Kuaiji: the First Chinese Accounting Large Language Model
Kuaiji: the First Chinese Accounting Large Language Model
Jiayuan Luo
Songhua Yang
Xiaoling Qiu
Panyu Chen
Yufei Nai
Wenxuan Zeng
Wentao Zhang
Xinke Jiang
RALM
ALM
38
1
0
21 Feb 2024
$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens
∞\infty∞Bench: Extending Long Context Evaluation Beyond 100K Tokens
Xinrong Zhang
Yingfa Chen
Shengding Hu
Zihang Xu
Junhao Chen
...
Xu Han
Zhen Leng Thai
Shuo Wang
Zhiyuan Liu
Maosong Sun
RALM
LRM
50
148
0
21 Feb 2024
LongWanjuan: Towards Systematic Measurement for Long Text Quality
LongWanjuan: Towards Systematic Measurement for Long Text Quality
Kai Lv
Xiaoran Liu
Qipeng Guo
Hang Yan
Conghui He
Xipeng Qiu
Dahua Lin
33
4
0
21 Feb 2024
PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text
  Retrieval Methods
PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods
Slawomir Dadas
Michal Perelkiewicz
Rafal Poswiata
49
3
0
20 Feb 2024
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
Fajri Koto
Haonan Li
Sara Shatnawi
Jad Doughman
Abdelrahman Boda Sadallah
...
Neha Sengupta
Shady Shehata
Nizar Habash
Preslav Nakov
Timothy Baldwin
ELM
LRM
80
31
0
20 Feb 2024
Previous
123...567...101112
Next