Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.07532
Cited By
An overview of 11 proposals for building safe advanced AI
4 December 2020
Evan Hubinger
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"An overview of 11 proposals for building safe advanced AI"
17 / 17 papers shown
Title
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
91
5
0
16 Feb 2025
An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models
Jaturong Kongmanee
34
1
0
28 Jan 2025
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas
Yu Lei
Hao Liu
Chengxing Xie
Songjia Liu
Zhiyu Yin
Canyu Chen
G. Li
Philip H. S. Torr
Zhen Wu
33
2
0
14 Oct 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
112
0
22 Apr 2024
Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects
Zhaowei Zhang
Fengshuo Bai
Mingzhi Wang
Haoyang Ye
Chengdong Ma
Yaodong Yang
27
4
0
20 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
34
78
0
25 Jan 2024
Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization
Houda Nait El Barj
Théophile Sautory
25
2
0
14 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery
Aaquib Syed
Can Rager
Arthur Conmy
65
56
0
16 Oct 2023
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
19
177
0
26 Sep 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
47
470
0
27 Jul 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
20
279
0
28 Apr 2023
Conditioning Predictive Models: Risks and Strategies
Evan Hubinger
Adam Jermyn
Johannes Treutlein
Rubi Hudson
Kate Woolverton
33
5
0
02 Feb 2023
Circumventing interpretability: How to defeat mind-readers
Lee D. Sharkey
35
3
0
21 Dec 2022
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Stephen Casper
K. Hariharan
Dylan Hadfield-Menell
AAML
18
11
0
18 Nov 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
20
124
0
27 Jul 2022
Robust Feature-Level Adversaries are Interpretability Tools
Stephen Casper
Max Nadeau
Dylan Hadfield-Menell
Gabriel Kreiman
AAML
42
27
0
07 Oct 2021
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
204
199
0
02 May 2018
1