ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.11926
  4. Cited By
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

14 March 2025
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
    LRM
ArXiv (abs)PDFHTML

Papers citing "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation"

14 / 64 papers shown
Title
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
888
13,207
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
850
9,714
0
28 Jan 2022
The Effects of Reward Misspecification: Mapping and Mitigating
  Misaligned Models
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
98
182
0
10 Jan 2022
WebGPT: Browser-assisted question-answering with human feedback
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALMRALM
196
1,294
0
17 Dec 2021
Show Your Work: Scratchpads for Intermediate Computation with Language
  Models
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Anders Andreassen
Guy Gur-Ari
Henryk Michalewski
Jacob Austin
...
Aitor Lewkowycz
Maarten Bosma
D. Luan
Charles Sutton
Augustus Odena
ReLMLRM
183
756
0
30 Nov 2021
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
350
4,596
0
27 Oct 2021
Internet-Augmented Dialogue Generation
Internet-Augmented Dialogue Generation
M. Komeili
Kurt Shuster
Jason Weston
RALM
300
289
0
15 Jul 2021
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
889
42,463
0
28 May 2020
Towards Faithfully Interpretable NLP Systems: How should we define and
  evaluate faithfulness?
Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?
Alon Jacovi
Yoav Goldberg
XAI
134
600
0
07 Apr 2020
Emergent Tool Use From Multi-Agent Autocurricula
Emergent Tool Use From Multi-Agent Autocurricula
Bowen Baker
I. Kanitscheider
Todor Markov
Yi Wu
Glenn Powell
Bob McGrew
Igor Mordatch
LRM
68
658
0
17 Sep 2019
A Deep Reinforced Model for Abstractive Summarization
A Deep Reinforced Model for Abstractive Summarization
Romain Paulus
Caiming Xiong
R. Socher
AI4TS
208
1,559
0
11 May 2017
Program Induction by Rationale Generation : Learning to Solve and
  Explain Algebraic Word Problems
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling
Dani Yogatama
Chris Dyer
Phil Blunsom
AIMat
109
737
0
11 May 2017
Data-efficient Deep Reinforcement Learning for Dexterous Manipulation
Data-efficient Deep Reinforcement Learning for Dexterous Manipulation
I. Popov
N. Heess
Timothy Lillicrap
Roland Hafner
Gabriel Barth-Maron
Matej Vecerík
Thomas Lampe
Yuval Tassa
Tom Erez
Martin Riedmiller
OffRL
90
265
0
10 Apr 2017
Concrete Problems in AI Safety
Concrete Problems in AI Safety
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
253
2,405
0
21 Jun 2016
Previous
12