Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

14 March 2025

Aleksander Mądry

Wojciech Zaremba

ArXiv (abs)PDF HTML

Papers citing "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation"

14 / 64 papers shown

Title
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 888 13,207 0 04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 850 9,714 0 28 Jan 2022
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models Alexander Pan Kush S. Bhatia Jacob Steinhardt 98 182 0 10 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback Reiichiro Nakano Jacob Hilton S. Balaji Jeff Wu Ouyang Long ... Gretchen Krueger Kevin Button Matthew Knight B. Chess John Schulman ALM RALM 196 1,294 0 17 Dec 2021
Show Your Work: Scratchpads for Intermediate Computation with Language Models Maxwell Nye Anders Andreassen Guy Gur-Ari Henryk Michalewski Jacob Austin ... Aitor Lewkowycz Maarten Bosma D. Luan Charles Sutton Augustus Odena ReLM LRM 183 756 0 30 Nov 2021
Training Verifiers to Solve Math Word Problems K. Cobbe V. Kosaraju Mohammad Bavarian Mark Chen Heewoo Jun ... Jerry Tworek Jacob Hilton Reiichiro Nakano Christopher Hesse John Schulman ReLM OffRL LRM 350 4,596 0 27 Oct 2021
Internet-Augmented Dialogue Generation M. Komeili Kurt Shuster Jason Weston RALM 300 289 0 15 Jul 2021
Language Models are Few-Shot Learners Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan ... Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei BDL 889 42,463 0 28 May 2020
Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? Alon Jacovi Yoav Goldberg XAI 134 600 0 07 Apr 2020
Emergent Tool Use From Multi-Agent Autocurricula Bowen Baker I. Kanitscheider Todor Markov Yi Wu Glenn Powell Bob McGrew Igor Mordatch LRM 68 658 0 17 Sep 2019
A Deep Reinforced Model for Abstractive Summarization Romain Paulus Caiming Xiong R. Socher AI4TS 208 1,559 0 11 May 2017
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems Wang Ling Dani Yogatama Chris Dyer Phil Blunsom AIMat 109 737 0 11 May 2017
Data-efficient Deep Reinforcement Learning for Dexterous Manipulation I. Popov N. Heess Timothy Lillicrap Roland Hafner Gabriel Barth-Maron Matej Vecerík Thomas Lampe Yuval Tassa Tom Erez Martin Riedmiller OffRL 90 265 0 10 Apr 2017
Concrete Problems in AI Safety Dario Amodei C. Olah Jacob Steinhardt Paul Christiano John Schulman Dandelion Mané 253 2,405 0 21 Jun 2016