ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17601
  4. Cited By
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

23 May 2025
Jiawei Kong
Hao Fang
Xiaochen Yang
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Yaowei Wang
Min Zhang
    AAML
ArXivPDFHTML

Papers citing "Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models"

16 / 16 papers shown
Title
Injecting Universal Jailbreak Backdoors into LLMs in Minutes
Zhuowei Chen
Qiannan Zhang
Shichao Pei
48
2
0
09 Feb 2025
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
Yihe Deng
Yu Yang
Junkai Zhang
Wei Wang
B. Li
OffRL
85
8
0
07 Feb 2025
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All
  Tools
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM
:
Aohan Zeng
Bin Xu
Bowen Wang
...
Zhaoyu Wang
Zhen Yang
Zhengxiao Du
Zhenyu Hou
Zihan Wang
ALM
106
601
0
18 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
85
122
0
10 Jun 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
146
206
0
02 Apr 2024
BadEdit: Backdooring large language models by model editing
BadEdit: Backdooring large language models by model editing
Yanzhou Li
Tianlin Li
Kangjie Chen
Jian Zhang
Shangqing Liu
Wenhan Wang
Tianwei Zhang
Yang Liu
SyDa
AAML
KELM
85
62
0
20 Mar 2024
Multi-Trigger Backdoor Attacks: More Triggers, More Threats
Multi-Trigger Backdoor Attacks: More Triggers, More Threats
Yige Li
Xingjun Ma
Jiabo He
Hanxun Huang
Yu-Gang Jiang
AAML
62
5
0
27 Jan 2024
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Zhen Xiang
Fengqing Jiang
Zidi Xiong
Bhaskar Ramasubramanian
Radha Poovendran
Bo Li
LRM
SILM
66
46
0
20 Jan 2024
Universal Jailbreak Backdoors from Poisoned Human Feedback
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando
Florian Tramèr
68
70
0
24 Nov 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
84
271
0
10 Oct 2023
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt
  Injection
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
Jun Yan
Vikas Yadav
Shiyang Li
Lichang Chen
Zheng Tang
Hai Wang
Vijay Srinivasan
Xiang Ren
Hongxia Jin
SILM
66
98
0
31 Jul 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
287
1,449
0
27 Jul 2023
BackdoorBench: A Comprehensive Benchmark of Backdoor Learning
BackdoorBench: A Comprehensive Benchmark of Backdoor Learning
Baoyuan Wu
Hongrui Chen
Ruotong Wang
Zihao Zhu
Shaokui Wei
Danni Yuan
Chaoxiao Shen
ELM
AAML
60
144
0
25 Jun 2022
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
731
41,894
0
28 May 2020
Weight Poisoning Attacks on Pre-trained Models
Weight Poisoning Attacks on Pre-trained Models
Keita Kurita
Paul Michel
Graham Neubig
AAML
SILM
134
450
0
14 Apr 2020
BadNets: Identifying Vulnerabilities in the Machine Learning Model
  Supply Chain
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu
Brendan Dolan-Gavitt
S. Garg
SILM
108
1,772
0
22 Aug 2017
1