Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.00027
Cited By
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
15 November 2023
Yuanpu Cao
Bochuan Cao
Jinghui Chen
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections"
20 / 20 papers shown
Title
BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models
Zhilin Wang
Hongwei Li
Rui Zhang
Wenbo Jiang
Kangjie Chen
Tianwei Zhang
Qingchuan Zhao
Jiawei Li
AAML
46
0
0
06 May 2025
BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge
Terry Tong
Fei-Yue Wang
Zhe Zhao
M. Chen
AAML
ELM
37
1
0
01 Mar 2025
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
X. Liu
Siyuan Liang
M. Han
Yong Luo
Aishan Liu
Xiantao Cai
Zheng He
Dacheng Tao
AAML
SILM
ELM
42
1
0
22 Feb 2025
Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models
Jiawei Liu
Zhuo Chen
Miaokun Chen
Fengchang Yu
Fan Zhang
Xiaofeng Wang
Wei Lu
Xiaozhong Liu
AAML
SILM
63
0
0
03 Feb 2025
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
Jared Joselowitz
Arjun Jagota
Satyapriya Krishna
Sonali Parbhoo
Nyal Patel
Satyapriya Krishna
Sonali Parbhoo
26
0
0
16 Oct 2024
Do Influence Functions Work on Large Language Models?
Zhe Li
Wei Zhao
Yige Li
Tianlong Chen
TDI
33
1
0
30 Sep 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
57
10
0
20 Jul 2024
Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers
Terry Tong
Lyne Tchapmi
Qin Liu
Muhao Chen
AAML
SILM
47
1
0
04 Jul 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAML
LLMSV
42
19
0
24 Jun 2024
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
Yuanpu Cao
Tianrong Zhang
Bochuan Cao
Ziyi Yin
Lu Lin
Fenglong Ma
Jinghui Chen
LLMSV
31
19
0
28 May 2024
TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models
Pengzhou Cheng
Yidong Ding
Tianjie Ju
Zongru Wu
Wei Du
Ping Yi
Zhuosheng Zhang
Gongshen Liu
SILM
AAML
40
19
0
22 May 2024
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
62
16
0
26 Feb 2024
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
Wenkai Yang
Xiaohan Bi
Yankai Lin
Sishuo Chen
Jie Zhou
Xu Sun
LLMAG
AAML
44
53
0
17 Feb 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong
Zhanhui Zhou
Chao Yang
Jing Shao
Yu Qiao
ELM
52
55
0
14 Feb 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
38
67
0
29 Jan 2024
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases
Rishabh Bhardwaj
Soujanya Poria
ALM
57
15
0
22 Oct 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
36
133
0
18 Sep 2023
Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review
Pengzhou Cheng
Zongru Wu
Wei Du
Haodong Zhao
Wei Lu
Gongshen Liu
SILM
AAML
31
17
0
12 Sep 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
106
227
0
15 Apr 2021
1