ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.16447
  4. Cited By
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

19 June 2025
Biao Yi
Tiansheng Huang
Sishuo Chen
Tong Li
Zheli Liu
Zhixuan Chu
Yiming Li
    AAML
ArXiv (abs)PDFHTML

Papers citing "Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models"

50 / 51 papers shown
Title
REFINE: Inversion-Free Backdoor Defense via Model Reprogramming
REFINE: Inversion-Free Backdoor Defense via Model Reprogramming
Yuxiao Chen
Shuo Shao
Enhao Huang
Yiming Li
Pin-Yu Chen
Zhan Qin
Kui Ren
AAML
99
9
0
22 Feb 2025
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A
  Survey
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
112
46
0
26 Sep 2024
Antidote: Post-fine-tuning Safety Alignment for Large Language Models
  against Harmful Fine-tuning
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Gautam Bhattacharya
Pratik Joshi
Josh Kimball
Ling Liu
AAMLMoMe
109
30
0
18 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAMLMU
127
63
0
01 Aug 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in
  Instruction-tuned Language Models
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAMLLLMSV
68
25
0
24 Jun 2024
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li
Zhangchen Xu
Fengqing Jiang
Luyao Niu
D. Sahabandu
Bhaskar Ramasubramanian
Radha Poovendran
SILMAAML
108
10
0
18 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
116
141
0
10 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large
  Language Models
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Xiaochun Cao
Min Lin
AAML
90
41
0
31 May 2024
Lazy Safety Alignment for Large Language Models against Harmful
  Fine-tuning
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
138
32
0
28 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
130
56
0
27 May 2024
BadActs: A Universal Backdoor Defense in the Activation Space
BadActs: A Universal Backdoor Defense in the Activation Space
Biao Yi
Sishuo Chen
Yiming Li
Tong Li
Baolei Zhang
Zheli Liu
AAML
71
7
0
18 May 2024
IBD-PSC: Input-level Backdoor Detection via Parameter-oriented Scaling
  Consistency
IBD-PSC: Input-level Backdoor Detection via Parameter-oriented Scaling Consistency
Linshan Hou
Ruili Feng
Zhongyun Hua
Wei Luo
Leo Yu Zhang
Yiming Li
AAML
81
23
0
16 May 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned
  LLMs
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Javier Rando
Francesco Croce
Kryvstof Mitka
Stepan Shabalin
Maksym Andriushchenko
Nicolas Flammarion
F. Tramèr
86
17
0
22 Apr 2024
Exploring Backdoor Vulnerabilities of Chat Models
Exploring Backdoor Vulnerabilities of Chat Models
Yunzhuo Hao
Wenkai Yang
Yankai Lin
SILMKELM
64
11
0
03 Apr 2024
Immunization against harmful fine-tuning attacks
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
102
22
0
26 Feb 2024
Vaccine: Perturbation-aware Alignment for Large Language Model
Vaccine: Perturbation-aware Alignment for Large Language Model
Tiansheng Huang
Sihao Hu
Ling Liu
115
49
0
02 Feb 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety
  Training
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger
Carson E. Denison
Jesse Mu
Mike Lambert
Meg Tong
...
Sören Mindermann
Ryan Greenblatt
Buck Shlegeris
Nicholas Schiefer
Ethan Perez
LLMAG
89
175
0
10 Jan 2024
Mixtral of Experts
Mixtral of Experts
Albert Q. Jiang
Alexandre Sablayrolles
Antoine Roux
A. Mensch
Blanche Savary
...
Théophile Gervet
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoELLMAG
164
1,123
0
08 Jan 2024
Universal Jailbreak Backdoors from Poisoned Human Feedback
Universal Jailbreak Backdoors from Poisoned Human Feedback
Javier Rando
Florian Tramèr
104
75
0
24 Nov 2023
Stealthy and Persistent Unalignment on Large Language Models via
  Backdoor Injections
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
Yuanpu Cao
Bochuan Cao
Jinghui Chen
84
28
0
15 Nov 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
75
312
0
10 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
105
279
0
10 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
133
633
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
126
257
0
05 Oct 2023
Defending Pre-trained Language Models as Few-shot Learners against
  Backdoor Attacks
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
Zhaohan Xi
Tianyu Du
Changjiang Li
Ren Pang
S. Ji
Jinghui Chen
Fenglong Ma
Ting Wang
AAML
68
33
0
23 Sep 2023
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
  Samples in NLP
ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP
Lu Yan
Zhuo Zhang
Guanhong Tao
Kaiyuan Zhang
Xuan Chen
Guangyu Shen
Xiangyu Zhang
AAMLSILM
102
22
0
04 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
295
1,518
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
419
12,076
0
18 Jul 2023
Towards Stealthy Backdoor Attacks against Speech Recognition via
  Elements of Sound
Towards Stealthy Backdoor Attacks against Speech Recognition via Elements of Sound
Hanbo Cai
Pengcheng Zhang
Hai Dong
Yan Xiao
Stefanos Koffas
Yiming Li
AAML
100
31
0
17 Jul 2023
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
  Models
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Wei Ping
Weixin Chen
Hengzhi Pei
Chulin Xie
Mintong Kang
...
Zinan Lin
Yuk-Kit Cheng
Sanmi Koyejo
Basel Alomair
Yue Liu
124
431
0
20 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALMOSLMELM
458
4,444
0
09 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
389
4,169
0
29 May 2023
Backdoor Attack with Sparse and Invisible Trigger
Backdoor Attack with Sparse and Invisible Trigger
Yinghua Gao
Yiming Li
Xueluan Gong
Zhifeng Li
Shutao Xia
Qianqian Wang
AAML
88
23
0
11 May 2023
UNICORN: A Unified Backdoor Trigger Inversion Framework
UNICORN: A Unified Backdoor Trigger Inversion Framework
Zhenting Wang
Kai Mei
Juan Zhai
Shiqing Ma
LLMSV
76
47
0
05 Apr 2023
TrojText: Test-time Invisible Textual Trojan Insertion
TrojText: Test-time Invisible Textual Trojan Insertion
Qiang Lou
Ye Liu
Bo Feng
125
26
0
03 Mar 2023
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor
  Attacks to InstructGPT
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT
Jiawen Shi
Yixin Liu
Pan Zhou
Lichao Sun
SILM
61
82
0
21 Feb 2023
SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via
  Analyzing Scaled Prediction Consistency
SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency
Junfeng Guo
Yiming Li
Xun Chen
Hanqing Guo
Lichao Sun
Cong Liu
AAMLMLAU
75
107
0
07 Feb 2023
Expose Backdoors on the Way: A Feature-Based Efficient Defense against
  Textual Backdoor Attacks
Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
Sishuo Chen
Wenkai Yang
Zhiyuan Zhang
Xiaohan Bi
Xu Sun
SILMAAML
70
26
0
14 Oct 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
894
13,228
0
04 Mar 2022
Constrained Optimization with Dynamic Bound-scaling for Effective
  NLPBackdoor Defense
Constrained Optimization with Dynamic Bound-scaling for Effective NLPBackdoor Defense
Guangyu Shen
Yingqi Liu
Guanhong Tao
Qiuling Xu
Zhuo Zhang
Shengwei An
Shiqing Ma
Xinming Zhang
AAML
81
40
0
11 Feb 2022
RAP: Robustness-Aware Perturbations for Defending against Backdoor
  Attacks on NLP Models
RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models
Wenkai Yang
Yankai Lin
Peng Li
Jie Zhou
Xu Sun
SILMAAML
136
112
0
15 Oct 2021
Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text
  Style Transfer
Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer
Fanchao Qi
Yangyi Chen
Xurui Zhang
Mukai Li
Zhiyuan Liu
Maosong Sun
AAMLSILM
150
186
0
14 Oct 2021
Defending Against Backdoor Attacks in Natural Language Generation
Defending Against Backdoor Attacks in Natural Language Generation
Xiaofei Sun
Xiaoya Li
Yuxian Meng
Xiang Ao
Leilei Gan
Jiwei Li
Tianwei Zhang
AAMLSILM
90
52
0
03 Jun 2021
Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger
Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger
Fanchao Qi
Mukai Li
Yangyi Chen
Zhengyan Zhang
Zhiyuan Liu
Yasheng Wang
Maosong Sun
SILM
78
234
0
26 May 2021
T-Miner: A Generative Approach to Defend Against Trojan Attacks on
  DNN-based Text Classification
T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification
A. Azizi
I. A. Tahmid
Asim Waheed
Neal Mangaokar
Jiameng Pu
M. Javed
Chandan K. Reddy
Bimal Viswanath
AAML
65
82
0
07 Mar 2021
ONION: A Simple and Effective Defense Against Textual Backdoor Attacks
ONION: A Simple and Effective Defense Against Textual Backdoor Attacks
Fanchao Qi
Yangyi Chen
Mukai Li
Yuan Yao
Zhiyuan Liu
Maosong Sun
AAML
105
283
0
20 Nov 2020
Backdoor Learning: A Survey
Backdoor Learning: A Survey
Yiming Li
Yong Jiang
Zhifeng Li
Shutao Xia
AAML
147
613
0
17 Jul 2020
Weight Poisoning Attacks on Pre-trained Models
Weight Poisoning Attacks on Pre-trained Models
Keita Kurita
Paul Michel
Graham Neubig
AAMLSILM
138
455
0
14 Apr 2020
Design and Evaluation of a Multi-Domain Trojan Detection Method on Deep
  Neural Networks
Design and Evaluation of a Multi-Domain Trojan Detection Method on Deep Neural Networks
Yansong Gao
Yeonjae Kim
Bao Gia Doan
Zhi-Li Zhang
Gongxuan Zhang
Surya Nepal
Damith C. Ranasinghe
Hyoungshick Kim
AAML
71
91
0
23 Nov 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
700
24,572
0
26 Jul 2019
12
Next