Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.12001
Cited By
An Overview of Catastrophic AI Risks
21 June 2023
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"An Overview of Catastrophic AI Risks"
31 / 31 papers shown
Title
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
78
0
0
05 May 2025
AI Awareness
Xianrui Li
Haoyuan Shi
Rongwu Xu
Wei Xu
59
0
0
25 Apr 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Yufei Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
83
0
0
24 Apr 2025
Superintelligence Strategy: Expert Version
Dan Hendrycks
Eric Schmidt
Alexandr Wang
69
1
0
07 Mar 2025
Forecasting Frontier Language Model Agent Capabilities
Govind Pimpale
Axel Højmark
Jérémy Scheurer
Marius Hobbhahn
LLMAG
ELM
49
1
0
21 Feb 2025
On Adversarial Robustness of Language Models in Transfer Learning
Bohdan Turbal
Anastasiia Mazur
Jiaxu Zhao
Mykola Pechenizkiy
AAML
47
0
0
03 Jan 2025
Neural Interactive Proofs
Lewis Hammond
Sam Adam-Day
AAML
92
2
0
12 Dec 2024
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
11
0
09 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
46
6
0
04 Oct 2024
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen
Ruiqi Zhong
Akbir Khan
Ethan Perez
Jacob Steinhardt
Minlie Huang
Samuel R. Bowman
He He
Shi Feng
32
34
0
19 Sep 2024
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
Zhe Su
Xuhui Zhou
Sanketh Rangreji
Anubha Kabra
Julia Mendelsohn
Faeze Brahman
Maarten Sap
LLMAG
106
3
0
13 Sep 2024
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
67
6
0
21 Aug 2024
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Jiayi Mao
Xueqi Cheng
AAML
47
10
0
17 Jun 2024
The Dual Imperative: Innovation and Regulation in the AI Era
Paulo Carvao
36
0
0
23 May 2024
Societal Adaptation to Advanced AI
Jamie Bernardi
Gabriel Mukobi
Hilary Greaves
Lennart Heim
Markus Anderljung
46
5
0
16 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
45
36
0
06 May 2024
Responsible Reporting for Frontier AI Development
Noam Kolt
Markus Anderljung
Joslyn Barnhart
Asher Brass
K. Esvelt
Gillian K. Hadfield
Lennart Heim
Mikel Rodriguez
Jonas B. Sandbrink
Thomas Woodside
62
14
0
03 Apr 2024
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
Xuhui Zhou
Zhe Su
Tiwalayo Eisape
Hyunwoo J. Kim
Maarten Sap
34
38
0
08 Mar 2024
The Philosopher's Stone: Trojaning Plugins of Large Language Models
Tian Dong
Minhui Xue
Guoxing Chen
Rayne Holland
Shaofeng Li
Yan Meng
Zhen Liu
Haojin Zhu
AAML
25
11
0
01 Dec 2023
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking
Rose Hadshar
26
6
0
27 Oct 2023
Interpretable Diffusion via Information Decomposition
Xianghao Kong
Ollie Liu
Han Li
Dani Yogatama
Greg Ver Steeg
24
21
0
12 Oct 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
47
142
0
03 Oct 2023
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
Chaoqi Wang
Yibo Jiang
Yuguang Yang
Han Liu
Yuxin Chen
42
82
0
28 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
33
344
0
15 Sep 2023
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
35
76
0
31 Jul 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
351
2,232
0
22 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
507
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
474
0
24 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
68
183
0
30 Aug 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
186
276
0
28 Sep 2021
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
301
1,616
0
18 Sep 2019
1