Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2506.14261
Cited By
v1
v2 (latest)
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
17 June 2025
Rohan Gupta
Erik Jenner
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?"
7 / 7 papers shown
Title
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
165
37
0
14 Mar 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
Subhash Kantamneni
Joshua Engels
Senthooran Rajamanoharan
Max Tegmark
Neel Nanda
127
17
0
23 Feb 2025
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
138
1,238
0
05 Feb 2024
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
291
1,498
0
27 Jul 2023
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
477
10,496
0
17 Jun 2021
Adversarial Policies: Attacking Deep Reinforcement Learning
Adam Gleave
Michael Dennis
Cody Wild
Neel Kant
Sergey Levine
Stuart J. Russell
AAML
83
359
0
25 May 2019
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Anish Athalye
Nicholas Carlini
D. Wagner
AAML
243
3,194
0
01 Feb 2018
1