ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.00626
  4. Cited By
The Alignment Problem from a Deep Learning Perspective

The Alignment Problem from a Deep Learning Perspective

30 August 2022
Richard Ngo
Lawrence Chan
Sören Mindermann
ArXivPDFHTML

Papers citing "The Alignment Problem from a Deep Learning Perspective"

50 / 131 papers shown
Title
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
46
115
0
28 Mar 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im
Yixuan Li
ALM
32
11
0
27 Mar 2024
A Comprehensive Study of Multimodal Large Language Models for Image
  Quality Assessment
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Tianhe Wu
Kede Ma
Jie-Kai Liang
Yujiu Yang
Lei Zhang
34
19
0
16 Mar 2024
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
Cassidy Laidlaw
Shivam Singhal
Anca Dragan
AAML
27
11
0
05 Mar 2024
Incentive Compatibility for AI Alignment in Sociotechnical Systems:
  Positions and Prospects
Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects
Zhaowei Zhang
Fengshuo Bai
Mingzhi Wang
Haoyang Ye
Chengdong Ma
Yaodong Yang
35
4
0
20 Feb 2024
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review
Thilo Hagendorff
21
35
0
13 Feb 2024
AI-Augmented Predictions: LLM Assistants Improve Human Forecasting
  Accuracy
AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
P. Schoenegger
Peter S. Park
Ezra Karger
P. Tetlock
45
14
0
12 Feb 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip Torr
Lewis Hammond
Christian Schroeder de Witt
48
4
0
12 Feb 2024
Limitations of Agents Simulated by Predictive Models
Limitations of Agents Simulated by Predictive Models
Raymond Douglas
Jacek Karwowski
Chan Bae
Andis Draguns
Victoria Krakovna
25
0
0
08 Feb 2024
Robust Prompt Optimization for Defending Language Models Against
  Jailbreaking Attacks
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
49
74
0
30 Jan 2024
Tradeoffs Between Alignment and Helpfulness in Language Models with
  Representation Engineering
Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering
Yotam Wolf
Noam Wies
Dorin Shteyman
Binyamin Rothberg
Yoav Levine
Amnon Shashua
LLMSV
31
13
0
29 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
34
78
0
25 Jan 2024
ARGS: Alignment as Reward-Guided Search
ARGS: Alignment as Reward-Guided Search
Maxim Khanov
Jirayu Burapacheep
Yixuan Li
35
46
0
23 Jan 2024
Visibility into AI Agents
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
20
32
0
23 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
120
94
0
22 Jan 2024
Universal Neurons in GPT2 Language Models
Universal Neurons in GPT2 Language Models
Wes Gurnee
Theo Horsley
Zifan Carl Guo
Tara Rezaei Kheirkhah
Qinyi Sun
Will Hathaway
Neel Nanda
Dimitris Bertsimas
MILM
102
37
0
22 Jan 2024
Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents
Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents
Quentin Delfosse
Sebastian Sztwiertnia
M. Rothermel
Wolfgang Stammer
Kristian Kersting
55
18
0
11 Jan 2024
Quantifying stability of non-power-seeking in artificial agents
Quantifying stability of non-power-seeking in artificial agents
Evan Ryan Gunter
Yevgeny Liokumovich
Victoria Krakovna
34
1
0
07 Jan 2024
Data-Centric Foundation Models in Computational Healthcare: A Survey
Data-Centric Foundation Models in Computational Healthcare: A Survey
Yunkun Zhang
Jin Gao
Zheling Tan
Lingfeng Zhou
Kexin Ding
Mu Zhou
Shaoting Zhang
Dequan Wang
AI4CE
39
22
0
04 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
  Supervision
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
50
260
0
14 Dec 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval
  Tasks in Language Models
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien
Eric Winsor
LRM
ReLM
76
10
0
13 Dec 2023
Tell, don't show: Declarative facts influence how LLMs generalize
Tell, don't show: Declarative facts influence how LLMs generalize
Alexander Meinke
Owain Evans
26
7
0
12 Dec 2023
AI Control: Improving Safety Despite Intentional Subversion
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
31
40
0
12 Dec 2023
What Causes Polysemanticity? An Alternative Origin Story of Mixed
  Selectivity from Incidental Causes
What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes
Victor Lecomte
Kushal Thaman
Rylan Schaeffer
Naomi Bashkansky
Trevor Chow
Sanmi Koyejo
AAML
MILM
20
7
0
05 Dec 2023
Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability,
  Explainability, and Safety
Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability, Explainability, and Safety
Manas Gaur
Amit P. Sheth
26
17
0
05 Dec 2023
Eliciting Latent Knowledge from Quirky Language Models
Eliciting Latent Knowledge from Quirky Language Models
Alex Troy Mallen
Madeline Brumley
Julia Kharchenko
Nora Belrose
HILM
RALM
KELM
19
25
0
02 Dec 2023
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
P. Bricman
21
0
0
01 Dec 2023
The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific
  Progress in NLP
The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP
Julian Michael
19
1
0
01 Dec 2023
Does GPT-4 pass the Turing test?
Does GPT-4 pass the Turing test?
Cameron R. Jones
Benjamin K. Bergen
ELM
23
34
0
31 Oct 2023
A Review of the Evidence for Existential Risk from AI via Misaligned
  Power-Seeking
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking
Rose Hadshar
26
6
0
27 Oct 2023
Managing extreme AI risks amid rapid progress
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Philip Torr
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
29
63
0
26 Oct 2023
Unpacking the Ethical Value Alignment in Big Models
Unpacking the Ethical Value Alignment in Big Models
Xiaoyuan Yi
Jing Yao
Xiting Wang
Xing Xie
24
11
0
26 Oct 2023
Implicit meta-learning may lead language models to trust more reliable
  sources
Implicit meta-learning may lead language models to trust more reliable sources
Dmitrii Krasheninnikov
Egor Krasheninnikov
Bruno Mlodozeniec
Tegan Maharaj
David M. Krueger
26
3
0
23 Oct 2023
Improving Generalization of Alignment with Human Preferences through
  Group Invariant Learning
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning
Rui Zheng
Wei Shen
Yuan Hua
Wenbin Lai
Shihan Dou
...
Xiao Wang
Haoran Huang
Tao Gui
Qi Zhang
Xuanjing Huang
56
14
0
18 Oct 2023
Large Language Model Prediction Capabilities: Evidence from a Real-World
  Forecasting Tournament
Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
P. Schoenegger
Peter S. Park
ELM
AI4TS
26
14
0
17 Oct 2023
Denevil: Towards Deciphering and Navigating the Ethical Values of Large
  Language Models via Instruction Learning
Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning
Shitong Duan
Xiaoyuan Yi
Peng Zhang
T. Lu
Xing Xie
Ning Gu
24
9
0
17 Oct 2023
Multinational AGI Consortium (MAGIC): A Proposal for International
  Coordination on AI
Multinational AGI Consortium (MAGIC): A Proposal for International Coordination on AI
Jason Hausenloy
Andrea Miotti
Claire Dennis
25
1
0
13 Oct 2023
Understanding and Controlling a Maze-Solving Policy Network
Understanding and Controlling a Maze-Solving Policy Network
Ulisse Mini
Peli Grietzer
Mrinank Sharma
Austin Meek
M. MacDiarmid
Alexander Matt Turner
14
15
0
12 Oct 2023
Language Models Represent Space and Time
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
47
142
0
03 Oct 2023
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking
  Unrelated Questions
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Lorenzo Pacchiardi
A. J. Chan
Sören Mindermann
Ilan Moscovitz
Alexa Y. Pan
Y. Gal
Owain Evans
J. Brauner
LLMAG
HILM
22
48
0
26 Sep 2023
Large Language Model Alignment: A Survey
Large Language Model Alignment: A Survey
Tianhao Shen
Renren Jin
Yufei Huang
Chuang Liu
Weilong Dong
Zishan Guo
Xinwei Wu
Yan Liu
Deyi Xiong
LM&MA
19
177
0
26 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
33
335
0
15 Sep 2023
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI
  chatbots at scientific writing?
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing?
Edisa Lozić
Benjamin Štular
39
30
0
14 Sep 2023
OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs
OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs
Patrick Haller
Ansar Aynetdinov
Alan Akbik
33
24
0
07 Sep 2023
Taken out of context: On measuring situational awareness in LLMs
Taken out of context: On measuring situational awareness in LLMs
Lukas Berglund
Asa Cooper Stickland
Mikita Balesni
Max Kaufmann
Meg Tong
Tomasz Korbak
Daniel Kokotajlo
Owain Evans
LLMAG
LRM
16
61
0
01 Sep 2023
International Governance of Civilian AI: A Jurisdictional Certification
  Approach
International Governance of Civilian AI: A Jurisdictional Certification Approach
Robert F. Trager
Benjamin Harack
Anka Reuel
A. Carnegie
Lennart Heim
...
R. Lall
Owen Larter
Seán Ó hÉigeartaigh
Simon Staffell
José Jaime Villalobos
26
20
0
29 Aug 2023
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning
  from Human Feedback
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
Souradip Chakraborty
Amrit Singh Bedi
Alec Koppel
Dinesh Manocha
Huazheng Wang
Mengdi Wang
Furong Huang
31
26
0
03 Aug 2023
VisAlign: Dataset for Measuring the Degree of Alignment between AI and
  Humans in Visual Perception
VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception
Jiyoung Lee
Seung Wook Kim
Seunghyun Won
Joonseok Lee
Marzyeh Ghassemi
James Thorne
Jaeseok Choi
O.-Kil Kwon
Edward Choi
30
1
0
03 Aug 2023
Deception Abilities Emerged in Large Language Models
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
35
75
0
31 Jul 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from
  Human Feedback
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
52
473
0
27 Jul 2023
Previous
123
Next