An overview of 11 proposals for building safe advanced AI

4 December 2020

Papers citing "An overview of 11 proposals for building safe advanced AI"

17 / 17 papers shown

Title
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? Yufei He Yuexin Li Jiaying Wu Yuan Sui Yulin Chen Bryan Hooi ALM 91 5 0 16 Feb 2025
An Attempt to Unraveling Token Prediction Refinement and Identifying Essential Layers of Large Language Models Jaturong Kongmanee 34 1 0 28 Jan 2025
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas Yu Lei Hao Liu Chengxing Xie Songjia Liu Zhiyu Yin Canyu Chen G. Li Philip H. S. Torr Zhen Wu 33 2 0 14 Oct 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 40 112 0 22 Apr 2024
Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects Zhaowei Zhang Fengshuo Bai Mingzhi Wang Haoyang Ye Chengdong Ma Yaodong Yang 27 4 0 20 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 34 78 0 25 Jan 2024
Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization Houda Nait El Barj Théophile Sautory 25 2 0 14 Jan 2024
Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed Can Rager Arthur Conmy 65 56 0 16 Oct 2023
Large Language Model Alignment: A Survey Tianhao Shen Renren Jin Yufei Huang Chuang Liu Weilong Dong Zishan Guo Xinwei Wu Yan Liu Deyi Xiong LM&MA 19 177 0 26 Sep 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper Xander Davies Claudia Shi T. Gilbert Jérémy Scheurer ... Erdem Biyik Anca Dragan David M. Krueger Dorsa Sadigh Dylan Hadfield-Menell ALM OffRL 47 470 0 27 Jul 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso 20 279 0 28 Apr 2023
Conditioning Predictive Models: Risks and Strategies Evan Hubinger Adam Jermyn Johannes Treutlein Rubi Hudson Kate Woolverton 33 5 0 02 Feb 2023
Circumventing interpretability: How to defeat mind-readers Lee D. Sharkey 35 3 0 21 Dec 2022
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks Stephen Casper K. Hariharan Dylan Hadfield-Menell AAML 18 11 0 18 Nov 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks Tilman Raukur A. Ho Stephen Casper Dylan Hadfield-Menell AAML AI4CE 20 124 0 27 Jul 2022
Robust Feature-Level Adversaries are Interpretability Tools Stephen Casper Max Nadeau Dylan Hadfield-Menell Gabriel Kreiman AAML 42 27 0 07 Oct 2021
AI safety via debate G. Irving Paul Christiano Dario Amodei 204 199 0 02 May 2018