ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.14261
  4. Cited By
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
v1v2 (latest)

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

17 June 2025
Rohan Gupta
Erik Jenner
ArXiv (abs)PDFHTML

Papers citing "RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?"

6 / 6 papers shown
Title
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
165
37
0
14 Mar 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
127
17
0
23 Feb 2025
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
  Language Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLMLRM
138
1,238
0
05 Feb 2024
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
291
1,498
0
27 Jul 2023
Adversarial Policies: Attacking Deep Reinforcement Learning
Adversarial Policies: Attacking Deep Reinforcement Learning
Adam Gleave
Michael Dennis
Cody Wild
Neel Kant
Sergey Levine
Stuart J. Russell
AAML
83
359
0
25 May 2019
Obfuscated Gradients Give a False Sense of Security: Circumventing
  Defenses to Adversarial Examples
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Anish Athalye
Nicholas Carlini
D. Wagner
AAML
243
3,194
0
01 Feb 2018
1