Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2206.05862
Cited By
v1
v2
v3
v4
v5
v6
v7 (latest)
X-Risk Analysis for AI Research
13 June 2022
Dan Hendrycks
Mantas Mazeika
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"X-Risk Analysis for AI Research"
49 / 49 papers shown
Title
The Lock-in Hypothesis: Stagnation by Algorithm
Tianyi Qiu
Zhonghao He
Tejasveer Chugh
Max Kleiman-Weiner
59
0
0
06 Jun 2025
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Stanley Yu
Vaidehi Bulusu
Oscar Yasunaga
Clayton Lau
Cole Blondin
Sean O'Brien
Kevin Zhu
Vasu Sharma
74
0
0
27 May 2025
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
Pedro M. P. Curvo
LLMAG
72
0
0
19 May 2025
Establishing Reliability Metrics for Reward Models in Large Language Models
Yizhou Chen
Yawen Liu
Xuesi Wang
Qingtao Yu
Guangda Huzhang
Anxiang Zeng
Han Yu
Zhiming Zhou
88
0
0
21 Apr 2025
Functional Abstraction of Knowledge Recall in Large Language Models
Zijian Wang
Chang Xu
KELM
68
1
0
20 Apr 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALM
ELM
488
2
0
04 Mar 2025
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models
Jaturong Kongmanee
89
1
0
25 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative
Atoosa Kasirzadeh
152
18
0
20 Jan 2025
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
177
1
0
12 Nov 2024
Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning
Farhang Yeganegi
Arian Eamaz
Mojtaba Soltanalian
53
0
0
14 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
Constantin Venhoff
Anisoara Calinescu
Philip Torr
Christian Schroeder de Witt
74
0
0
09 Oct 2024
Speculations on Uncertainty and Humane Algorithms
Nicholas Gray
90
0
0
13 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
87
25
0
31 Jul 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Atsuyuki Miyai
Jingkang Yang
Jingyang Zhang
Yifei Ming
Sisir Dhakal
...
Yixuan Li
Hai "Helen" Li
Ziwei Liu
Toshihiko Yamasaki
Kiyoharu Aizawa
141
13
0
31 Jul 2024
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son
William Bankes
Sayak Ray Chowdhury
Brooks Paige
Ilija Bogunovic
131
4
0
26 Jul 2024
FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning
Saandeep Aathreya
Shaun J. Canavan
OODD
77
0
0
03 Jul 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
136
23
0
24 Jun 2024
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts
Renchunzi Xie
Ambroise Odonnat
Vasilii Feofanov
Weijian Deng
Jianfeng Zhang
Bo An
82
2
0
29 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
139
158
0
22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Laura Weidinger
Joslyn Barnhart
Jenny Brennan
Christina Butterfield
Susie Young
...
Sebastian Farquhar
Lewis Ho
Iason Gabriel
Allan Dafoe
William S. Isaac
ELM
90
9
0
22 Apr 2024
SOTOPIA-
π
π
π
: Interactive Learning of Socially Intelligent Language Agents
Ruiyi Wang
Haofei Yu
W. Zhang
Zhengyang Qi
Maarten Sap
Graham Neubig
Yonatan Bisk
Hao Zhu
LLMAG
121
44
0
13 Mar 2024
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
181
104
0
22 Jan 2024
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
Alan Chan
Ben Bucknall
Herbie Bradley
David M. Krueger
66
6
0
22 Dec 2023
Survey on AI Ethics: A Socio-technical Perspective
Dave Mbiazi
Meghana Bhange
Maryam Babaei
Ivaxi Sheth
Patrik Kenfack
99
5
0
28 Nov 2023
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Michael Lan
Phillip H. S. Torr
Fazl Barez
LRM
68
3
0
07 Nov 2023
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing?
Edisa Lozić
Benjamin Štular
91
35
0
14 Sep 2023
Uncovering the Hidden Cost of Model Compression
Diganta Misra
Muawiz Chaudhary
Agam Goyal
Bharat Runwal
Pin-Yu Chen
VLM
95
0
0
29 Aug 2023
How Good Are LLMs at Out-of-Distribution Detection?
Bo Liu
Li-Ming Zhan
Zexin Lu
Yu Feng
Lei Xue
Xiao-Ming Wu
OODD
79
9
0
20 Aug 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
154
125
0
06 Jul 2023
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
82
186
0
21 Jun 2023
Evaluating Superhuman Models with Consistency Checks
Lukas Fluri
Daniel Paleka
Florian Tramèr
ELM
124
44
0
16 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
123
158
0
07 Jun 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
68
319
0
28 Apr 2023
Can GPT-4 Perform Neural Architecture Search?
Mingkai Zheng
Xiu Su
Shan You
Fei Wang
Chao Qian
Chang Xu
Samuel Albanie
ELM
126
59
0
21 Apr 2023
Fairness in AI and Its Long-Term Implications on Society
Ondrej Bohdal
Timothy M. Hospedales
Philip Torr
Fazl Barez
104
4
0
16 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
117
137
0
06 Apr 2023
Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods
Thilo Hagendorff
LLMAG
135
72
0
24 Mar 2023
Artificial Influence: An Analysis Of AI-Driven Persuasion
Matthew Burtell
T. Woodside
67
36
0
15 Mar 2023
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez
Ian Ribeiro
SILM
108
452
0
17 Nov 2022
The Expertise Problem: Learning from Specialized Feedback
Oliver Daniels-Koch
Rachel Freedman
OffRL
70
18
0
12 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
323
563
0
01 Nov 2022
Broken Neural Scaling Laws
Ethan Caballero
Kshitij Gupta
Irina Rish
David M. Krueger
153
76
0
26 Oct 2022
Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in Dialog Systems
David Gros
Yu Li
Zhou Yu
90
11
0
22 Oct 2022
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Mantas Mazeika
Eric Tang
Andy Zou
Steven Basart
Jun Shern Chan
Dawn Song
David A. Forsyth
Jacob Steinhardt
Dan Hendrycks
87
10
0
18 Oct 2022
OpenOOD: Benchmarking Generalized Out-of-Distribution Detection
Jingkang Yang
Pengyun Wang
Dejian Zou
Zitang Zhou
Kun Ding
...
Kaiyang Zhou
Wayne Zhang
Dan Hendrycks
Yixuan Li
Ziwei Liu
OODD
95
245
0
13 Oct 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
141
192
0
30 Aug 2022
Forecasting Future World Events with Neural Networks
Andy Zou
Tristan Xiao
Ryan Jia
Joe Kwon
Mantas Mazeika
Richard Li
Dawn Song
Jacob Steinhardt
Owain Evans
Dan Hendrycks
108
27
0
30 Jun 2022
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges
Mohammadreza Salehi
Hossein Mirzaei
Dan Hendrycks
Yixuan Li
M. Rohban
Mohammad Sabokrou
OOD
169
199
0
26 Oct 2021
Generalized Out-of-Distribution Detection: A Survey
Jingkang Yang
Kaiyang Zhou
Yixuan Li
Ziwei Liu
311
956
0
21 Oct 2021
1