Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.17601
Cited By
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models
23 May 2025
Jiawei Kong
Hao Fang
Xiaochen Yang
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Yaowei Wang
Min Zhang
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models"
16 / 16 papers shown
Title
Injecting Universal Jailbreak Backdoors into LLMs in Minutes
Zhuowei Chen
Qiannan Zhang
Shichao Pei
48
2
0
09 Feb 2025
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
Yihe Deng
Yu Yang
Junkai Zhang
Wei Wang
B. Li
OffRL
85
8
0
07 Feb 2025
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM
:
Aohan Zeng
Bin Xu
Bowen Wang
...
Zhaoyu Wang
Zhen Yang
Zhengxiao Du
Zhenyu Hou
Zihan Wang
ALM
106
601
0
18 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
85
122
0
10 Jun 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
146
206
0
02 Apr 2024
BadEdit: Backdooring large language models by model editing
Yanzhou Li
Tianlin Li
Kangjie Chen
Jian Zhang
Shangqing Liu
Wenhan Wang
Tianwei Zhang
Yang Liu
SyDa
AAML
KELM
85
62
0
20 Mar 2024
Multi-Trigger Backdoor Attacks: More Triggers, More Threats
Yige Li
Xingjun Ma
Jiabo He
Hanxun Huang
Yu-Gang Jiang
AAML
62
5
0
27 Jan 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang
Fengqing Jiang
Zidi Xiong
Bhaskar Ramasubramanian
Radha Poovendran
Bo Li
LRM
SILM
66
46
0
20 Jan 2024
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando
Florian Tramèr
68
70
0
24 Nov 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
84
271
0
10 Oct 2023
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
Jun Yan
Vikas Yadav
Shiyang Li
Lichang Chen
Zheng Tang
Hai Wang
Vijay Srinivasan
Xiang Ren
Hongxia Jin
SILM
66
98
0
31 Jul 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
287
1,449
0
27 Jul 2023
BackdoorBench: A Comprehensive Benchmark of Backdoor Learning
Baoyuan Wu
Hongrui Chen
Ruotong Wang
Zihao Zhu
Shaokui Wei
Danni Yuan
Chaoxiao Shen
ELM
AAML
60
144
0
25 Jun 2022
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
731
41,894
0
28 May 2020
Weight Poisoning Attacks on Pre-trained Models
Keita Kurita
Paul Michel
Graham Neubig
AAML
SILM
134
450
0
14 Apr 2020
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu
Brendan Dolan-Gavitt
S. Garg
SILM
108
1,772
0
22 Aug 2017
1