Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2411.04368
Cited By
Measuring short-form factuality in large language models
7 November 2024
Jason W. Wei
Nguyen Karina
Hyung Won Chung
Yunxin Joy Jiao
Spencer Papay
Amelia Glaese
John Schulman
W. Fedus
ELM
KELM
HILM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (3566★)
Papers citing
"Measuring short-form factuality in large language models"
35 / 35 papers shown
Title
CC-LEARN: Cohort-based Consistency Learning
Xiao Ye
Shaswat Shrivastava
Zhaonan Li
Jacob Dineen
Shijie Lu
Avneet Ahuja
Ming shen
Zhikun Xu
Ben Zhou
OffRL
LRM
50
0
0
18 Jun 2025
Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
Brendan Leigh Ross
Noël Vouitsis
Atiyeh Ashari Ghomi
Rasa Hosseinzadeh
Ji Xin
...
Yi Sui
Shiyi Hou
Kin Kwan Leung
Gabriel Loaiza-Ganem
Jesse C. Cresswell
74
0
0
11 Jun 2025
The Geometries of Truth Are Orthogonal Across Tasks
Waiss Azizian
Michael Kirchhof
Eugène Ndiaye
Louis Béthune
Michal Klein
Pierre Ablin
Marco Cuturi
34
0
0
10 Jun 2025
AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
Jaeho Lee
Atharv Chowdhary
HILM
40
0
0
08 Jun 2025
ConfQA: Answer Only If You Are Confident
Yin Huang
Yifan Ethan Xu
Kai Sun
Vera Yan
Alicia Sun
...
Yue Liu
Aaron Colak
Anuj Kumar
Wen-tau Yih
Xin Luna Dong
HILM
28
0
0
08 Jun 2025
Quantifying Cross-Modality Memorization in Vision-Language Models
Yuxin Wen
Yangsibo Huang
Tom Goldstein
Ravi Kumar
Badih Ghazi
Chiyuan Zhang
120
0
0
05 Jun 2025
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham
Nguyen Nguyen
Pratibha Zunjare
Weiyuan Chen
Yu-Min Tseng
Tu Vu
RALM
ReLM
ELM
ALM
LRM
104
0
0
01 Jun 2025
Inter-Passage Verification for Multi-evidence Multi-answer QA
Bingsen Chen
Shengjie Wang
Xi Ye
Chen Zhao
RALM
39
0
0
31 May 2025
Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning
Wenxuan Shi
Haochen Tan
Chuqiao Kuang
Xiaoguang Li
Xiaozhe Ren
...
Hanting Chen
Yasheng Wang
Lifeng Shang
Fisher Yu
Yunhe Wang
RALM
38
0
0
30 May 2025
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
Gabrielle Kaili-May Liu
Gal Yona
Avi Caciularu
Idan Szpektor
Tim G. J. Rudner
Arman Cohan
46
0
0
30 May 2025
Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs
Jakub Podolak
Rajeev Verma
ReLM
LRM
27
0
0
28 May 2025
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Ekaterina Fadeeva
Aleksandr Rubashevskii
Roman Vashurin
Shehzaad Dhuliawala
Artem Shelmanov
Timothy Baldwin
Preslav Nakov
Mrinmaya Sachan
Maxim Panov
HILM
77
0
0
27 May 2025
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering
Bowen Jiang
Runchuan Zhu
Jiang Wu
Zinco Jiang
Yifan He
...
Haote Yang
Songyang Zhang
Dahua Lin
Lijun Wu
Conghui He
ELM
56
0
0
22 May 2025
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Gagan Bhatia
Maxime Peyrard
Wei Zhao
65
0
0
22 May 2025
An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents
Bowen Jin
Jinsung Yoon
Priyanka Kargupta
Sercan O. Arik
Jiawei Han
LRM
153
2
0
21 May 2025
Pre-training Large Memory Language Models with Internal and External Knowledge
Linxi Zhao
Sofian Zalouk
Christian K. Belardi
Justin Lovelace
Jin Peng Zhou
Kilian Q. Weinberger
Yoav Artzi
Jennifer J. Sun
KELM
HILM
107
0
0
21 May 2025
Adaptive Plan-Execute Framework for Smart Contract Security Auditing
Zhiyuan Wei
Jing Sun
Zijian Zhang
Zhe Hou
Zixiao Zhao
196
0
0
21 May 2025
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang
Jinzhe Tu
Haoran Liu
Xiaoce Wang
Chujie Zheng
...
Caishun Chen
Tiantian He
Hongning Wang
Yew-Soon Ong
Minlie Huang
LRM
107
0
0
18 May 2025
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
Khanh-Tung Tran
Barry O'Sullivan
Hoang D. Nguyen
ELM
LRM
120
0
0
16 May 2025
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
Jingcheng Niu
Xingdi Yuan
Tong Wang
Hamidreza Saghir
Amir H. Abdi
79
0
0
14 May 2025
HalluLens: LLM Hallucination Benchmark
Yejin Bang
Ziwei Ji
Alan Schelten
Anthony Hartshorn
Tara Fowler
Cheng Zhang
Nicola Cancedda
Pascale Fung
HILM
132
5
0
24 Apr 2025
aiXamine: Simplified LLM Safety and Security
Fatih Deniz
Dorde Popovic
Yazan Boshmaf
Euisuh Jeong
M. Ahmad
Sanjay Chawla
Issa M. Khalil
ELM
341
0
0
21 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
127
6
0
20 Apr 2025
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
Andrea Santilli
Adam Goliñski
Michael Kirchhof
Federico Danieli
Arno Blaas
Miao Xiong
Luca Zappella
Sinead Williamson
71
3
0
18 Apr 2025
Unity RL Playground: A Versatile Reinforcement Learning Framework for Mobile Robots
Linqi Ye
Rankun Li
Xiaowen Hu
Jiayi Li
Boyang Xing
Yan Peng
Bin Liang
113
0
0
07 Mar 2025
BOSE: A Systematic Evaluation Method Optimized for Base Models
Hongzhi Luan
Changxin Tian
Zhaoxin Huan
Xiaolu Zhang
Kunlong Chen
Qing Cui
Zhiqiang Zhang
93
1
0
02 Mar 2025
A Survey of Uncertainty Estimation Methods on Large Language Models
Zhiqiu Xia
Jinxuan Xu
Yuqian Zhang
Hang Liu
109
3
0
28 Feb 2025
Conformal Linguistic Calibration: Trading-off between Factuality and Specificity
Zhengping Jiang
Anqi Liu
Benjamin Van Durme
178
3
0
26 Feb 2025
Optimizing Model Selection for Compound AI Systems
Lingjiao Chen
Jared Quincy Davis
Boris Hanin
Peter Bailis
Matei A. Zaharia
James Zou
Ion Stoica
126
4
0
20 Feb 2025
Unbiased Evaluation of Large Language Models from a Causal Perspective
Meilin Chen
Jian Tian
Liang Ma
Di Xie
Weijie Chen
Jiang Zhu
ALM
ELM
168
0
0
10 Feb 2025
STAIR: Improving Safety Alignment with Introspective Reasoning
Yuanhang Zhang
Siyuan Zhang
Yao Huang
Zeyu Xia
Zhengwei Fang
Xiao Yang
Ranjie Duan
Dong Yan
Yinpeng Dong
Jun Zhu
LRM
LLMSV
166
7
0
04 Feb 2025
Trading Inference-Time Compute for Adversarial Robustness
Wojciech Zaremba
Evgenia Nitishinskaya
Boaz Barak
Stephanie Lin
Sam Toyer
...
Rachel Dias
Eric Wallace
Kai Y. Xiao
Johannes Heidecke
Amelia Glaese
LRM
AAML
171
26
0
31 Jan 2025
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
Satyapriya Krishna
Kalpesh Krishna
Anhad Mohananey
Steven Schwarcz
Adam Stambler
Shyam Upadhyay
Manaal Faruqui
ReLM
3DV
LRM
RALM
103
30
0
28 Jan 2025
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input
Alon Jacovi
Andrew Wang
Chris Alberti
Connie Tao
Jon Lipovetz
...
Rachana Fellinger
Rui Wang
Zizhao Zhang
Sasha Goldshtein
Dipanjan Das
HILM
ALM
198
17
0
06 Jan 2025
Decoding Knowledge in Large Language Models: A Framework for Categorization and Comprehension
Yanbo Fang
Ruixiang Tang
ELM
83
0
0
03 Jan 2025
1