ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.12206
  4. Cited By
Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?

16 February 2025
Yufei He
Yuexin Li
Jiaying Wu
Yuan Sui
Yulin Chen
Bryan Hooi
    ALM
ArXivPDFHTML

Papers citing "Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?"

5 / 5 papers shown
Title
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
Y. Chen
Haoran Li
Yuan Sui
Y. Liu
Yufei He
Y. Song
Bryan Hooi
AAML
SILM
63
0
0
29 Apr 2025
AI Awareness
AI Awareness
X. Li
Haoyuan Shi
Rongwu Xu
Wei Xu
54
0
0
25 Apr 2025
Safety in Large Reasoning Models: A Survey
Safety in Large Reasoning Models: A Survey
Cheng Wang
Y. Liu
B. Li
Duzhen Zhang
Z. Li
Junfeng Fang
Bryan Hooi
LRM
145
1
0
24 Apr 2025
Efficient Inference for Large Reasoning Models: A Survey
Efficient Inference for Large Reasoning Models: A Survey
Y. Liu
Jiaying Wu
Yufei He
Hongcheng Gao
Hongyu Chen
Baolong Bi
Jiaheng Zhang
Zhiqi Huang
Bryan Hooi
LLMAG
LRM
67
7
0
29 Mar 2025
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
Yulin Chen
Haoran Li
Zihao Zheng
Zihao Zheng
Yangqiu Song
Bryan Hooi
45
6
0
17 Aug 2024
1