ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.05862
  4. Cited By
X-Risk Analysis for AI Research
v1v2v3v4v5v6v7 (latest)

X-Risk Analysis for AI Research

13 June 2022
Dan Hendrycks
Mantas Mazeika
ArXiv (abs)PDFHTML

Papers citing "X-Risk Analysis for AI Research"

49 / 49 papers shown
Title
The Lock-in Hypothesis: Stagnation by Algorithm
The Lock-in Hypothesis: Stagnation by Algorithm
Tianyi Qiu
Zhonghao He
Tejasveer Chugh
Max Kleiman-Weiner
59
0
0
06 Jun 2025
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Stanley Yu
Vaidehi Bulusu
Oscar Yasunaga
Clayton Lau
Cole Blondin
Sean O'Brien
Kevin Zhu
Vasu Sharma
74
0
0
27 May 2025
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations
Pedro M. P. Curvo
LLMAG
72
0
0
19 May 2025
Establishing Reliability Metrics for Reward Models in Large Language Models
Establishing Reliability Metrics for Reward Models in Large Language Models
Yizhou Chen
Yawen Liu
Xuesi Wang
Qingtao Yu
Guangda Huzhang
Anxiang Zeng
Han Yu
Zhiming Zhou
88
0
0
21 Apr 2025
Functional Abstraction of Knowledge Recall in Large Language Models
Functional Abstraction of Knowledge Recall in Large Language Models
Zijian Wang
Chang Xu
KELM
68
1
0
20 Apr 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALMELM
488
2
0
04 Mar 2025
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models
Jaturong Kongmanee
89
1
0
25 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative
Two Types of AI Existential Risk: Decisive and Accumulative
Atoosa Kasirzadeh
152
18
0
20 Jan 2025
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
177
1
0
12 Nov 2024
Data-Aware Training Quality Monitoring and Certification for Reliable
  Deep Learning
Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning
Farhang Yeganegi
Arian Eamaz
Mojtaba Soltanalian
53
0
0
14 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders
Constantin Venhoff
Anisoara Calinescu
Philip Torr
Christian Schroeder de Witt
74
0
0
09 Oct 2024
Speculations on Uncertainty and Humane Algorithms
Speculations on Uncertainty and Humane Algorithms
Nicholas Gray
90
0
0
13 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
87
25
0
31 Jul 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
Atsuyuki Miyai
Jingkang Yang
Jingyang Zhang
Yifei Ming
Sisir Dhakal
...
Yixuan Li
Hai "Helen" Li
Ziwei Liu
Toshihiko Yamasaki
Kiyoharu Aizawa
141
13
0
31 Jul 2024
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son
William Bankes
Sayak Ray Chowdhury
Brooks Paige
Ilija Bogunovic
131
4
0
26 Jul 2024
FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive
  Learning
FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning
Saandeep Aathreya
Shaun J. Canavan
OODD
77
0
0
03 Jul 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
136
23
0
24 Jun 2024
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under
  Distribution Shifts
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts
Renchunzi Xie
Ambroise Odonnat
Vasilii Feofanov
Weijian Deng
Jianfeng Zhang
Bo An
82
2
0
29 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
139
158
0
22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Holistic Safety and Responsibility Evaluations of Advanced AI Models
Laura Weidinger
Joslyn Barnhart
Jenny Brennan
Christina Butterfield
Susie Young
...
Sebastian Farquhar
Lewis Ho
Iason Gabriel
Allan Dafoe
William S. Isaac
ELM
90
9
0
22 Apr 2024
SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language
  Agents
SOTOPIA-πππ: Interactive Learning of Socially Intelligent Language Agents
Ruiyi Wang
Haofei Yu
W. Zhang
Zhengyang Qi
Maarten Sap
Graham Neubig
Yonatan Bisk
Hao Zhu
LLMAG
121
44
0
13 Mar 2024
WARM: On the Benefits of Weight Averaged Reward Models
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
181
104
0
22 Jan 2024
Hazards from Increasingly Accessible Fine-Tuning of Downloadable
  Foundation Models
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
Alan Chan
Ben Bucknall
Herbie Bradley
David M. Krueger
66
6
0
22 Dec 2023
Survey on AI Ethics: A Socio-technical Perspective
Survey on AI Ethics: A Socio-technical Perspective
Dave Mbiazi
Meghana Bhange
Maryam Babaei
Ivaxi Sheth
Patrik Kenfack
99
5
0
28 Nov 2023
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits
  in Large Language Models
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Michael Lan
Phillip H. S. Torr
Fazl Barez
LRM
68
3
0
07 Nov 2023
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI
  chatbots at scientific writing?
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing?
Edisa Lozić
Benjamin Štular
91
35
0
14 Sep 2023
Uncovering the Hidden Cost of Model Compression
Uncovering the Hidden Cost of Model Compression
Diganta Misra
Muawiz Chaudhary
Agam Goyal
Bharat Runwal
Pin-Yu Chen
VLM
95
0
0
29 Aug 2023
How Good Are LLMs at Out-of-Distribution Detection?
How Good Are LLMs at Out-of-Distribution Detection?
Bo Liu
Li-Ming Zhan
Zexin Lu
Yu Feng
Lei Xue
Xiao-Ming Wu
OODD
79
9
0
20 Aug 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
154
125
0
06 Jul 2023
An Overview of Catastrophic AI Risks
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
82
186
0
21 Jun 2023
Evaluating Superhuman Models with Consistency Checks
Evaluating Superhuman Models with Consistency Checks
Lukas Fluri
Daniel Paleka
Florian Tramèr
ELM
124
44
0
16 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating
  weights fine-tuned on diverse rewards
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
123
158
0
07 Jun 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
68
319
0
28 Apr 2023
Can GPT-4 Perform Neural Architecture Search?
Can GPT-4 Perform Neural Architecture Search?
Mingkai Zheng
Xiu Su
Shan You
Fei Wang
Chao Qian
Chang Xu
Samuel Albanie
ELM
126
59
0
21 Apr 2023
Fairness in AI and Its Long-Term Implications on Society
Fairness in AI and Its Long-Term Implications on Society
Ondrej Bohdal
Timothy M. Hospedales
Philip Torr
Fazl Barez
104
4
0
16 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
117
137
0
06 Apr 2023
Machine Psychology: Investigating Emergent Capabilities and Behavior in
  Large Language Models Using Psychological Methods
Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods
Thilo Hagendorff
LLMAG
135
72
0
24 Mar 2023
Artificial Influence: An Analysis Of AI-Driven Persuasion
Artificial Influence: An Analysis Of AI-Driven Persuasion
Matthew Burtell
T. Woodside
67
36
0
15 Mar 2023
Ignore Previous Prompt: Attack Techniques For Language Models
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez
Ian Ribeiro
SILM
108
452
0
17 Nov 2022
The Expertise Problem: Learning from Specialized Feedback
The Expertise Problem: Learning from Specialized Feedback
Oliver Daniels-Koch
Rachel Freedman
OffRL
70
18
0
12 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
323
563
0
01 Nov 2022
Broken Neural Scaling Laws
Broken Neural Scaling Laws
Ethan Caballero
Kshitij Gupta
Irina Rish
David M. Krueger
153
76
0
26 Oct 2022
Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in
  Dialog Systems
Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in Dialog Systems
David Gros
Yu Li
Zhou Yu
90
11
0
22 Oct 2022
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
Mantas Mazeika
Eric Tang
Andy Zou
Steven Basart
Jun Shern Chan
Dawn Song
David A. Forsyth
Jacob Steinhardt
Dan Hendrycks
87
10
0
18 Oct 2022
OpenOOD: Benchmarking Generalized Out-of-Distribution Detection
OpenOOD: Benchmarking Generalized Out-of-Distribution Detection
Jingkang Yang
Pengyun Wang
Dejian Zou
Zitang Zhou
Kun Ding
...
Kaiyang Zhou
Wayne Zhang
Dan Hendrycks
Yixuan Li
Ziwei Liu
OODD
95
245
0
13 Oct 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
141
192
0
30 Aug 2022
Forecasting Future World Events with Neural Networks
Forecasting Future World Events with Neural Networks
Andy Zou
Tristan Xiao
Ryan Jia
Joe Kwon
Mantas Mazeika
Richard Li
Dawn Song
Jacob Steinhardt
Owain Evans
Dan Hendrycks
108
27
0
30 Jun 2022
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution
  Detection: Solutions and Future Challenges
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges
Mohammadreza Salehi
Hossein Mirzaei
Dan Hendrycks
Yixuan Li
M. Rohban
Mohammad Sabokrou
OOD
169
199
0
26 Oct 2021
Generalized Out-of-Distribution Detection: A Survey
Generalized Out-of-Distribution Detection: A Survey
Jingkang Yang
Kaiyang Zhou
Yixuan Li
Ziwei Liu
311
956
0
21 Oct 2021
1