Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2504.17404
Cited By
v1
v2
v3
v4
v5 (latest)
Super Co-alignment of Human and AI for Sustainable Symbiotic Society
24 April 2025
Yi Zeng
Yijiao Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
Haibo Tong
Yao Liang
Dongqi Liang
Kang Sun
Lei Wang
Yitao Liang
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Super Co-alignment of Human and AI for Sustainable Symbiotic Society"
34 / 34 papers shown
Title
AssistanceZero: Scalably Solving Assistance Games
Cassidy Laidlaw
Eli Bronstein
Timothy Guo
Dylan Feng
Lukas Berglund
Justin Svegliato
Stuart J. Russell
Anca Dragan
82
1
0
09 Apr 2025
Weak-to-Strong Generalization Through the Data-Centric Lens
Changho Shin
John Cooper
Frederic Sala
190
9
0
05 Dec 2024
Building Altruistic and Moral AI Agent with Brain-inspired Affective Empathy Mechanisms
Feifei Zhao
Hui Feng
Haibo Tong
Zhengqiang Han
Enmeng Lu
Yinqian Sun
Yi Zeng
77
1
0
29 Oct 2024
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu
Zhiwei He
Xiaofeng Wang
Pengfei Liu
Rui Wang
OSLM
107
6
0
24 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Zhaochun Ren
150
10
0
10 Oct 2024
Your Weak LLM is Secretly a Strong Teacher for Alignment
Leitian Tao
Yixuan Li
145
9
0
13 Sep 2024
Prover-Verifier Games improve legibility of LLM outputs
Jan Hendrik Kirchner
Yining Chen
Harri Edwards
Jan Leike
Nat McAleese
Yuri Burda
LRM
AAML
68
32
0
18 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
105
36
0
05 Jul 2024
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu
Yang Zhang
Xuchuan Huang
Jasmine Xinze Li
Yalan Qin
Yaodong Yang
AI4TS
106
7
0
28 Jun 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
120
15
0
17 Jun 2024
Aligner: Efficient Alignment by Learning to Correct
Jiaming Ji
Boyuan Chen
Hantao Lou
Chongye Guo
Borong Zhang
Xuehai Pan
Juntao Dai
Tianyi Qiu
Yaodong Yang
137
40
0
04 Feb 2024
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen
Yihe Deng
Huizhuo Yuan
Kaixuan Ji
Quanquan Gu
SyDa
132
326
0
02 Jan 2024
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
123
297
0
14 Dec 2023
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Philip Torr
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
99
67
0
26 Oct 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
364
246
0
20 Oct 2023
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi
Wenxiang Chen
Xin Guo
Wei He
Yiwen Ding
...
Wenjuan Qin
Yongyan Zheng
Xipeng Qiu
Xuanjing Huan
Tao Gui
LM&MA
LM&Ro
3DV
AI4CE
156
958
0
14 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
79
162
0
28 Aug 2023
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
95
89
0
31 Jul 2023
An Overview of Catastrophic AI Risks
Dan Hendrycks
Mantas Mazeika
Thomas Woodside
SILM
82
186
0
21 Jun 2023
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du
Shuang Li
Antonio Torralba
J. Tenenbaum
Igor Mordatch
LLMAG
LRM
181
750
0
23 May 2023
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou
Zhihong Shao
Yeyun Gong
Yelong Shen
Yujiu Yang
Nan Duan
Weizhu Chen
KELM
LRM
142
398
0
19 May 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
233
1,648
0
15 Dec 2022
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALM
ELM
100
132
0
04 Nov 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
258
2,630
0
12 Apr 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
918
13,266
0
04 Mar 2022
Red Teaming Language Models with Language Models
Ethan Perez
Saffron Huang
Francis Song
Trevor Cai
Roman Ring
John Aslanides
Amelia Glaese
Nat McAleese
G. Irving
AAML
196
671
0
07 Feb 2022
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
116
183
0
10 Jan 2022
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
Tom Everitt
Marcus Hutter
Ramana Kumar
Victoria Krakovna
105
97
0
13 Aug 2019
Scalable agent alignment via reward modeling: a research direction
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
124
420
0
19 Nov 2018
Supervising strong learners by amplifying weak experts
Paul Christiano
Buck Shlegeris
Dario Amodei
66
124
0
19 Oct 2018
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
270
223
0
02 May 2018
Inverse Reward Design
Dylan Hadfield-Menell
S. Milli
Pieter Abbeel
Stuart J. Russell
Anca Dragan
96
400
0
08 Nov 2017
Concrete Problems in AI Safety
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
293
2,406
0
21 Jun 2016
Cooperative Inverse Reinforcement Learning
Dylan Hadfield-Menell
Anca Dragan
Pieter Abbeel
Stuart J. Russell
115
645
0
09 Jun 2016
1