Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.12206
Cited By
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
16 February 2025
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?"
5 / 5 papers shown
Title
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Y. Chen
Haoran Li
Yuan Sui
Y. Liu
Yufei He
Y. Song
Bryan Hooi
AAML
SILM
63
0
0
29 Apr 2025
AI Awareness
X. Li
Haoyuan Shi
Rongwu Xu
Wei Xu
54
0
0
25 Apr 2025
Safety in Large Reasoning Models: A Survey
Cheng Wang
Y. Liu
B. Li
Duzhen Zhang
Z. Li
Junfeng Fang
Bryan Hooi
LRM
145
1
0
24 Apr 2025
Efficient Inference for Large Reasoning Models: A Survey
Y. Liu
Jiaying Wu
Yufei He
Hongcheng Gao
Hongyu Chen
Baolong Bi
Jiaheng Zhang
Zhiqi Huang
Bryan Hooi
LLMAG
LRM
67
7
0
29 Mar 2025
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
Yulin Chen
Haoran Li
Zihao Zheng
Zihao Zheng
Yangqiu Song
Bryan Hooi
45
6
0
17 Aug 2024
1