Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

    LRM

Papers citing "Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation"

50 / 64 papers shown
Title
Language Models Learn to Mislead Humans via RLHF
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen
Ruiqi Zhong
Akbir Khan
Ethan Perez
Jacob Steinhardt
Minlie Huang
Samuel R. Bowman
He He
Shi Feng
81
43
0
19 Sep 2024

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.