ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1805.00899
  4. Cited By
AI safety via debate

AI safety via debate

2 May 2018
G. Irving
Paul Christiano
Dario Amodei
ArXivPDFHTML

Papers citing "AI safety via debate"

42 / 42 papers shown
Title
A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning
A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning
Kolawole E. Ogunsina
Morayo A. Ogunsina
41
0
0
06 May 2025
An alignment safety case sketch based on debate
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
36
0
0
06 May 2025
What Is AI Safety? What Do We Want It to Be?
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
68
0
0
05 May 2025
Scaling Laws For Scalable Oversight
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
70
0
0
25 Apr 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Y. Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
83
0
0
24 Apr 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
137
0
0
27 Feb 2025
KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
Yuxing Lu
Jinzhuo Wang
38
1
0
10 Feb 2025
Neural Interactive Proofs
Neural Interactive Proofs
Lewis Hammond
Sam Adam-Day
AAML
84
2
0
12 Dec 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies
WARP: On the Benefits of Weight Averaged Rewarded Policies
Alexandre Ramé
Johan Ferret
Nino Vieillard
Robert Dadashi
Léonard Hussenot
Pierre-Louis Cedoz
Pier Giuseppe Sessa
Sertan Girgin
Arthur Douillard
Olivier Bachem
54
13
0
24 Jun 2024
Stress-Testing Capability Elicitation With Password-Locked Models
Stress-Testing Capability Elicitation With Password-Locked Models
Ryan Greenblatt
Fabien Roger
Dmitrii Krasheninnikov
David M. Krueger
32
13
0
29 May 2024
Designing for Human-Agent Alignment: Understanding what humans want from
  their agents
Designing for Human-Agent Alignment: Understanding what humans want from their agents
Nitesh Goyal
Minsuk Chang
Michael Terry
39
14
0
04 Apr 2024
AI Control: Improving Safety Despite Intentional Subversion
AI Control: Improving Safety Despite Intentional Subversion
Ryan Greenblatt
Buck Shlegeris
Kshitij Sachan
Fabien Roger
29
38
0
12 Dec 2023
Playing Large Games with Oracles and AI Debate
Playing Large Games with Oracles and AI Debate
Xinyi Chen
Angelica Chen
Dean Foster
Elad Hazan
30
3
0
08 Dec 2023
Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
D. Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
211
178
0
20 Oct 2023
Beyond Reverse KL: Generalizing Direct Preference Optimization with
  Diverse Divergence Constraints
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
Chaoqi Wang
Yibo Jiang
Yuguang Yang
Han Liu
Yuxin Chen
26
81
0
28 Sep 2023
Cognitive Architectures for Language Agents
Cognitive Architectures for Language Agents
T. Sumers
Shunyu Yao
Karthik Narasimhan
Thomas L. Griffiths
LLMAG
LM&Ro
42
151
0
05 Sep 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
65
231
0
12 Aug 2023
Improving Factuality and Reasoning in Language Models through Multiagent
  Debate
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du
Shuang Li
Antonio Torralba
J. Tenenbaum
Igor Mordatch
LLMAG
LRM
44
601
0
23 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Rui Pan
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
16
401
0
13 Apr 2023
Discovering Latent Knowledge in Language Models Without Supervision
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
52
322
0
07 Dec 2022
Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALM
ELM
26
121
0
04 Nov 2022
Will we run out of data? Limits of LLM scaling based on human-generated
  data
Will we run out of data? Limits of LLM scaling based on human-generated data
Pablo Villalobos
A. Ho
J. Sevilla
T. Besiroglu
Lennart Heim
Marius Hobbhahn
ALM
33
108
0
26 Oct 2022
Goal Misgeneralization: Why Correct Specifications Aren't Enough For
  Correct Goals
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Rohin Shah
Vikrant Varma
Ramana Kumar
Mary Phuong
Victoria Krakovna
J. Uesato
Zachary Kenton
32
68
0
04 Oct 2022
Improving alignment of dialogue agents via targeted human judgements
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
...
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
ALM
AAML
227
500
0
28 Sep 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
54
181
0
30 Aug 2022
Hierarchical Symbolic Reasoning in Hyperbolic Space for Deep
  Discriminative Models
Hierarchical Symbolic Reasoning in Hyperbolic Space for Deep Discriminative Models
Ainkaran Santhirasekaram
Avinash Kori
A. Rockall
Mathias Winkler
Francesca Toni
Ben Glocker
FAtt
42
4
0
05 Jul 2022
Forecasting Future World Events with Neural Networks
Forecasting Future World Events with Neural Networks
Andy Zou
Tristan Xiao
Ryan Jia
Joe Kwon
Mantas Mazeika
Richard Li
Dawn Song
Jacob Steinhardt
Owain Evans
Dan Hendrycks
22
22
0
30 Jun 2022
Self-critiquing models for assisting human evaluators
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALM
ELM
29
279
0
12 Jun 2022
Adversarial Training for High-Stakes Reliability
Adversarial Training for High-Stakes Reliability
Daniel M. Ziegler
Seraphina Nix
Lawrence Chan
Tim Bauman
Peter Schmidt-Nielsen
...
Noa Nabeshima
Benjamin Weinstein-Raun
D. Haas
Buck Shlegeris
Nate Thomas
AAML
30
59
0
03 May 2022
Single-Turn Debate Does Not Help Humans Answer Hard
  Reading-Comprehension Questions
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions
Alicia Parrish
H. Trivedi
Ethan Perez
Angelica Chen
Nikita Nangia
Jason Phang
Sam Bowman
17
14
0
11 Apr 2022
Safe Deep RL in 3D Environments using Human Feedback
Safe Deep RL in 3D Environments using Human Feedback
Matthew Rahtz
Vikrant Varma
Ramana Kumar
Zachary Kenton
Shane Legg
Jan Leike
24
4
0
20 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
52
1,195
0
17 Dec 2021
Recursively Summarizing Books with Human Feedback
Recursively Summarizing Books with Human Feedback
Jeff Wu
Long Ouyang
Daniel M. Ziegler
Nissan Stiennon
Ryan J. Lowe
Jan Leike
Paul Christiano
ALM
21
294
0
22 Sep 2021
Impossibility Results in AI: A Survey
Impossibility Results in AI: A Survey
Mario Brčič
Roman V. Yampolskiy
8
25
0
01 Sep 2021
Intelligence and Unambitiousness Using Algorithmic Information Theory
Intelligence and Unambitiousness Using Algorithmic Information Theory
Michael K. Cohen
Badri N. Vellambi
Marcus Hutter
9
2
0
13 May 2021
An overview of 11 proposals for building safe advanced AI
An overview of 11 proposals for building safe advanced AI
Evan Hubinger
AAML
6
23
0
04 Dec 2020
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
J. Uesato
Ramana Kumar
Victoria Krakovna
Tom Everitt
Richard Ngo
Shane Legg
21
14
0
17 Nov 2020
AI safety: state of the field through quantitative lens
AI safety: state of the field through quantitative lens
Mislav Juric
A. Sandic
Mario Brčič
18
24
0
12 Feb 2020
Advocacy Learning: Learning through Competition and Class-Conditional
  Representations
Advocacy Learning: Learning through Competition and Class-Conditional Representations
Ian Fox
Jenna Wiens
SSL
15
2
0
07 Aug 2019
The Role of Cooperation in Responsible AI Development
The Role of Cooperation in Responsible AI Development
Amanda Askell
Miles Brundage
Gillian Hadfield
14
59
0
10 Jul 2019
Scalable agent alignment via reward modeling: a research direction
Scalable agent alignment via reward modeling: a research direction
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
28
392
0
19 Nov 2018
AGI Safety Literature Review
AGI Safety Literature Review
Tom Everitt
G. Lea
Marcus Hutter
AI4CE
28
115
0
03 May 2018
1