Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.10569
Cited By
v1
v2 (latest)
Deceptive Alignment Monitoring
20 July 2023
Andres Carranza
Dhruv Pai
Rylan Schaeffer
Arnuv Tandon
Oluwasanmi Koyejo
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Deceptive Alignment Monitoring"
13 / 13 papers shown
Title
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
119
8
0
21 Aug 2024
FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation
Dhruv Pai
Andres Carranza
Rylan Schaeffer
Arnuv Tandon
Oluwasanmi Koyejo
AAML
64
2
0
20 Jul 2023
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Zhiqing Sun
Songlin Yang
Qinhong Zhou
Hongxin Zhang
Zhenfang Chen
David D. Cox
Yiming Yang
Chuang Gan
SyDa
ALM
110
337
0
04 May 2023
Aligning Text-to-Image Models using Human Feedback
Kimin Lee
Hao Liu
Moonkyung Ryu
Olivia Watkins
Yuqing Du
Craig Boutilier
Pieter Abbeel
Mohammad Ghavamzadeh
S. Gu
EGVM
115
285
0
23 Feb 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
216
1,646
0
15 Dec 2022
Mass-Editing Memory in a Transformer
Kevin Meng
Arnab Sen Sharma
A. Andonian
Yonatan Belinkov
David Bau
KELM
VLM
154
599
0
13 Oct 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
137
192
0
30 Aug 2022
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
295
2,521
0
15 Jun 2022
Memory-Based Model Editing at Scale
E. Mitchell
Charles Lin
Antoine Bosselut
Christopher D. Manning
Chelsea Finn
KELM
116
362
0
13 Jun 2022
A Modern Self-Referential Weight Matrix That Learns to Modify Itself
Kazuki Irie
Imanol Schlag
Róbert Csordás
Jürgen Schmidhuber
50
28
0
11 Feb 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
253
1,390
0
10 Feb 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
264
294
0
28 Sep 2021
Risks from Learned Optimization in Advanced Machine Learning Systems
Evan Hubinger
Chris van Merwijk
Vladimir Mikulik
Joar Skalse
Scott Garrabrant
99
154
0
05 Jun 2019
1