Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.18556
Cited By
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
24 May 2025
Jun Zhuang
Haibo Jin
Ye Zhang
Zhengjian Kang
Wenbin Zhang
Gaby G. Dagher
Haohan Wang
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation"
25 / 25 papers shown
Title
Detecting Conversational Mental Manipulation with Intent-Aware Prompting
Jiayuan Ma
Hongbin Na
Zehua Wang
Yining Hua
Yue Liu
Wei Wang
Ling-Hao Chen
104
9
0
11 Dec 2024
Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents
Jaekyeom Kim
Dong-Ki Kim
Lajanugen Logeswaran
Sungryull Sohn
Honglak Lee
LLMAG
LM&Ro
LRM
71
3
0
29 Oct 2024
Intent Detection in the Age of LLMs
Gaurav Arora
Shreya Jain
Srujana Merugu
62
10
0
02 Oct 2024
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Peng Kuang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
84
16
0
23 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
118
110
0
05 Jul 2024
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
95
22
0
30 May 2024
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
Tianrong Zhang
Bochuan Cao
Yuanpu Cao
Lu Lin
Prasenjit Mitra
Jinghui Chen
AAML
97
12
0
22 May 2024
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
Shang Shang
Xinqiang Zhao
Zhongjiang Yao
Yepeng Yao
Liya Su
Zijing Fan
Xiaodan Zhang
Zhengwei Jiang
106
6
0
06 May 2024
Beyond the Known: Investigating LLMs Performance on Out-of-Domain Intent Detection
Pei Wang
Keqing He
Yejie Wang
Xiaoshuai Song
Yutao Mou
Jingang Wang
Yunsen Xian
Xunliang Cai
Weiran Xu
99
9
0
27 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
133
109
0
19 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
112
418
0
06 Feb 2024
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang
Liang Ding
Lefei Zhang
Dacheng Tao
LLMSV
71
29
0
12 Jan 2024
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Anay Mehrotra
Manolis Zampetakis
Paul Kassianik
Blaine Nelson
Hyrum Anderson
Yaron Singer
Amin Karbasi
89
271
0
04 Dec 2023
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
Peng Ding
Jun Kuang
Dan Ma
Xuezhi Cao
Yunsen Xian
Jiajun Chen
Shujian Huang
AAML
85
122
0
14 Nov 2023
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
Zhanke Zhou
Jianing Zhu
Jiangchao Yao
Tongliang Liu
Bo Han
117
189
0
06 Nov 2023
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu
Ruiyi Zhang
Bang An
Gang Wu
Joe Barrow
Zichao Wang
Furong Huang
A. Nenkova
Tong Sun
SILM
AAML
86
49
0
23 Oct 2023
Large Language Models Meet Open-World Intent Discovery and Recognition: An Evaluation of ChatGPT
Xiaoshuai Song
Keqing He
Pei Wang
Guanting Dong
Yutao Mou
Jingang Wang
Yunsen Xian
Xunliang Cai
Weiran Xu
LRM
98
16
0
16 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
142
709
0
12 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
108
279
0
10 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
221
352
0
19 Sep 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
121
283
0
12 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
297
1,518
0
27 Jul 2023
Z-BERT-A: a zero-shot Pipeline for Unknown Intent detection
Daniele Comi
Dimitrios Christofidellis
Pier Francesco Piazza
Matteo Manica
93
4
0
15 Aug 2022
Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training
Li-Ming Zhan
Haowen Liang
Bo Liu
Lu Fan
Xiao-Ming Wu
Albert Y. S. Lam
OODD
61
80
0
16 Jun 2021
Efficient Intent Detection with Dual Sentence Encoders
I. Casanueva
Tadas Temvcinas
D. Gerz
Matthew Henderson
Ivan Vulić
VLM
372
480
0
10 Mar 2020
1