Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2209.00626
Cited By
The Alignment Problem from a Deep Learning Perspective
30 August 2022
Richard Ngo
Lawrence Chan
Sören Mindermann
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Alignment Problem from a Deep Learning Perspective"
50 / 131 papers shown
Title
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
38
0
0
06 May 2025
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
68
0
0
05 May 2025
Real-World Gaps in AI Governance Research
Ilan Strauss
Isobel Moure
Tim O'Reilly
Sruly Rosenblat
63
0
0
30 Apr 2025
Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models
Thilo Hagendorff
Sarah Fabi
ReLM
ELM
LRM
45
0
0
14 Apr 2025
An Evaluation of Cultural Value Alignment in LLM
Nicholas Sukiennik
Chen Gao
Fengli Xu
Yongqian Li
29
0
0
11 Apr 2025
On the Robustness of GUI Grounding Models Against Image Attacks
Haoren Zhao
Tianyi Chen
Zhen Wang
AAML
36
1
0
07 Apr 2025
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
54
0
0
02 Apr 2025
AI threats to national security can be countered through an incident regime
Alejandro Ortega
43
0
0
25 Mar 2025
Mixture of Experts Made Intrinsically Interpretable
Xingyi Yang
Constantin Venhoff
Ashkan Khakzar
Christian Schroeder de Witt
P. Dokania
Adel Bibi
Philip H. S. Torr
MoE
49
0
0
05 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
80
9
0
24 Feb 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
64
2
0
20 Jan 2025
Learning to Assist Humans without Inferring Rewards
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
40
2
0
17 Jan 2025
Measuring Error Alignment for Decision-Making Systems
Binxia Xu
Antonis Bikakis
Daniel Onah
A. Vlachidis
Luke Dickens
41
0
0
03 Jan 2025
Measuring Goal-Directedness
Matt MacDermott
James Fox
Francesco Belardinelli
Tom Everitt
93
1
0
06 Dec 2024
The Two-Hop Curse: LLMs trained on A
→
\rightarrow
→
B, B
→
\rightarrow
→
C fail to learn A
→
\rightarrow
→
C
Mikita Balesni
Tomek Korbak
Owain Evans
ReLM
LRM
79
0
0
25 Nov 2024
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
Frédéric Berdoz
Roger Wattenhofer
98
0
0
21 Nov 2024
Safety case template for frontier AI: A cyber inability argument
Arthur Goemans
Marie Davidsen Buhl
Jonas Schuett
Tomek Korbak
Jessica Wang
Benjamin Hilton
Geoffrey Irving
58
15
0
12 Nov 2024
AI Ethics by Design: Implementing Customizable Guardrails for Responsible AI Development
Kristina Šekrst
Jeremy McHugh
Jonathan Rodriguez Cefalu
67
0
0
05 Nov 2024
Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge
Weihua Du
Qiushi Lyu
Jiaming Shan
Zhenting Qi
Hongxin Zhang
...
Andi Peng
Tianmin Shu
Kwonjoon Lee
Behzad Dariush
Chuang Gan
40
1
0
04 Nov 2024
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
47
9
0
29 Oct 2024
Fast Best-of-N Decoding via Speculative Rejection
Hanshi Sun
Momin Haider
Ruiqi Zhang
Huitao Yang
Jiahao Qiu
Ming Yin
Mengdi Wang
Peter L. Bartlett
Andrea Zanette
BDL
45
28
0
26 Oct 2024
AI, Global Governance, and Digital Sovereignty
Swati Srivastava
Justin Bullock
37
0
0
23 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
35
12
0
17 Oct 2024
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas
Yu Lei
Hao Liu
Chengxing Xie
Songjia Liu
Zhiyu Yin
Canyu Chen
Bernard Ghanem
Philip H. S. Torr
Zhen Wu
33
3
0
14 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
52
7
0
10 Oct 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip H. S. Torr
Austin Meek
Ashkan Khakzar
David M. Krueger
Fazl Barez
43
10
0
09 Oct 2024
Towards Measuring Goal-Directedness in AI Systems
Dylan Xu
Juan-Pablo Rivera
24
3
0
07 Oct 2024
Moral Alignment for LLM Agents
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
45
1
0
02 Oct 2024
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr
Jérémy Scheurer
46
3
0
07 Sep 2024
On the Generalization of Preference Learning with DPO
Shawn Im
Yixuan Li
49
1
0
06 Aug 2024
The Sociolinguistic Foundations of Language Modeling
Jack Grieve
Sara Bartl
Matteo Fuoli
Jason Grafmiller
Weihang Huang
A. Jawerbaum
Akira Murakami
Marcus Perlman
Dana Roemling
Bodo Winter
41
7
0
12 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
36
12
0
06 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
43
29
0
05 Jul 2024
ProductAgent: Benchmarking Conversational Product Search Agent with Asking Clarification Questions
Jingheng Ye
Yong Jiang
Xiaobin Wang
Hai-Tao Zheng
Yangning Li
Hai-Tao Zheng
Pengjun Xie
Fei Huang
40
2
0
01 Jul 2024
Towards shutdownable agents via stochastic choice
Elliott Thornley
Alexander Roman
Christos Ziakas
Leyton Ho
Louis Thomson
38
0
0
30 Jun 2024
Aligning Model Properties via Conformal Risk Control
William Overman
Jacqueline Jil Vallon
Mohsen Bayati
33
2
0
26 Jun 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
59
14
0
24 Jun 2024
It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
Taiming Lu
Lingfeng Shen
Xinyu Yang
Weiting Tan
Beidi Chen
Huaxiu Yao
61
2
0
12 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
47
23
0
11 Jun 2024
Aligning Large Language Models with Representation Editing: A Control Perspective
Lingkai Kong
Haorui Wang
Wenhao Mu
Yuanqi Du
Yuchen Zhuang
Yifei Zhou
Yue Song
Rongzhi Zhang
Kai Wang
Chao Zhang
30
22
0
10 Jun 2024
Towards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning
Andreas Schlaginhaufen
Maryam Kamgarpour
OffRL
23
1
0
03 Jun 2024
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
David M. Krueger
38
14
0
29 May 2024
The Dual Imperative: Innovation and Regulation in the AI Era
Paulo Carvao
31
0
0
23 May 2024
LIRE: listwise reward enhancement for preference alignment
Mingye Zhu
Yi Liu
Lei Zhang
Junbo Guo
Zhendong Mao
26
7
0
22 May 2024
Wav-KAN: Wavelet Kolmogorov-Arnold Networks
Zavareh Bozorgasl
Hao Chen
30
96
0
21 May 2024
Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn
Bilal Chughtai
Owain Evans
47
1
0
13 May 2024
People cannot distinguish GPT-4 from a human in a Turing test
Cameron R. Jones
Benjamin K. Bergen
ELM
DeLMO
42
31
0
09 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
114
0
22 Apr 2024
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
Shreyas Chaudhari
Pranjal Aggarwal
Vishvak Murahari
Tanmay Rajpurohit
A. Kalyan
Karthik Narasimhan
A. Deshpande
Bruno Castro da Silva
26
34
0
12 Apr 2024
The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
Noah Y. Siegel
Oana-Maria Camburu
N. Heess
Maria Perez-Ortiz
23
8
0
04 Apr 2024
1
2
3
Next