Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.00287
Cited By
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
30 December 2023
Neeraj Varshney
Pavel Dolin
Agastya Seth
Chitta Baral
AAML
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness"
31 / 31 papers shown
Title
Adversarial Suffix Filtering: a Defense Pipeline for LLMs
David Khachaturov
Robert D. Mullins
AAML
26
0
0
14 May 2025
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures
Francisco Aguilera-Martínez
Fernando Berzal
PILM
55
0
0
02 May 2025
Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
Wang Zhu
Tianqi Chen
Ching Ying Lin
Jade Law
Mazen Jizzini
Jorge J. Nieva
Ruishan Liu
Robin Jia
39
0
0
15 Apr 2025
Exploring Backdoor Attack and Defense for LLM-empowered Recommendations
Liangbo Ning
Wenqi Fan
Qing Li
AAML
SILM
50
0
0
15 Apr 2025
The Structural Safety Generalization Problem
Julius Broomfield
Tom Gibbs
Ethan Kosak-Hine
George Ingebretsen
Tia Nasir
Jason Zhang
Reihaneh Iranmanesh
Sara Pieri
Reihaneh Rabbany
Kellin Pelrine
AAML
35
0
0
13 Apr 2025
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation
Liangbo Ning
Wenqi Fan
Qing Li
AAML
41
1
0
03 Apr 2025
Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks
Ali Al-Kaswan
Sebastian Deatc
Begüm Koç
A. van Deursen
M. Izadi
AAML
45
0
0
02 Apr 2025
PiCo: Jailbreaking Multimodal Large Language Models via
Pi
\textbf{Pi}
Pi
ctorial
Co
\textbf{Co}
Co
de Contextualization
Aofan Liu
Lulu Tang
Ting Pan
Yuguo Yin
Bin Wang
Ao Yang
MLLM
AAML
47
0
0
02 Apr 2025
Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery
Nicholas Clark
Hua Shen
Bill Howe
Tanushree Mitra
31
0
0
01 Apr 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Shiyu Xiang
Ansen Zhang
Yanfei Cao
Yang Fan
Ronghao Chen
AAML
65
0
0
26 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
59
0
0
24 Feb 2025
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
Zhiyu Xue
Guangliang Liu
Bocheng Chen
K. Johnson
Ramtin Pedarsani
AAML
73
0
0
13 Dec 2024
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
He Cao
Weidi Luo
Zijing Liu
Yu Wang
Bing Feng
Yuan Yao
Yuan Yao
Yu Li
AAML
56
1
0
23 Oct 2024
SoK: Prompt Hacking of Large Language Models
Baha Rababah
Shang
Wu
Matthew Kwiatkowski
Carson Leung
Cuneyt Gurcan Akcora
AAML
43
2
0
16 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
87
1
0
09 Oct 2024
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs
Lu Yan
Siyuan Cheng
Xuan Chen
Kaiyuan Zhang
Guangyu Shen
Zhuo Zhang
Xiangyu Zhang
AAML
SILM
23
0
0
05 Oct 2024
Alignment with Preference Optimization Is All You Need for LLM Safety
Réda Alami
Ali Khalifa Almansoori
Ahmed Alzubaidi
M. Seddik
Mugariya Farooq
Hakim Hacid
37
1
0
12 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
42
13
0
01 Sep 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
59
11
0
25 Jul 2024
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
Blazej Manczak
Eliott Zemour
Eric Lin
Vaikkunth Mugunthan
26
2
0
23 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
36
2
0
03 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Chaos with Keywords: Exposing Large Language Models Sycophantic Hallucination to Misleading Keywords and Evaluating Defense Strategies
Aswin Rrv
Nemika Tyagi
Md Nayem Uddin
Neeraj Varshney
Chitta Baral
45
4
0
06 Jun 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
47
13
0
28 May 2024
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
Shaona Ghosh
Prasoon Varshney
Erick Galinkin
Christopher Parisien
ELM
43
36
0
09 Apr 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu
Pin-Yu Chen
Tsung-Yi Ho
AAML
26
26
0
01 Mar 2024
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings
Hao Wang
Hao Li
Minlie Huang
Lei Sha
AAML
40
12
0
25 Feb 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou
Yufei Han
Haomin Zhuang
Kehan Guo
Zhenwen Liang
Hongyan Bao
Xiangliang Zhang
LLMAG
AAML
42
11
0
20 Feb 2024
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
Matteo Gioele Collu
Tom Janssen-Groesbeek
Stefanos Koffas
Mauro Conti
S. Picek
21
1
0
06 Dec 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
51
236
0
10 Oct 2023
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish
Angelica Chen
Nikita Nangia
Vishakh Padmakumar
Jason Phang
Jana Thompson
Phu Mon Htut
Sam Bowman
223
374
0
15 Oct 2021
1