Human-Interpretable Adversarial Prompt Attack on Large Language Models
with Situational Context

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

19 July 2024

Papers citing "Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context"

9 / 9 papers shown

Title
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Maksym Andriushchenko Francesco Croce Nicolas Flammarion AAML 119 186 0 02 Apr 2024
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings Hao Wang Hao Li Minlie Huang Lei Sha AAML 61 13 0 25 Feb 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs Yi Zeng Hongpeng Lin Jingwen Zhang Diyi Yang Ruoxi Jia Weiyan Shi 60 284 0 12 Jan 2024
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Jiahao Yu Xingwei Lin Zheng Yu Xinyu Xing SILM 142 330 0 19 Sep 2023
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective Jindong Wang Xixu Hu Wenxin Hou Hao Chen Runkai Zheng ... Weirong Ye Xiubo Geng Binxing Jiao Yue Zhang Xingxu Xie AI4MH 83 227 0 22 Feb 2023
RARR: Researching and Revising What Language Models Say, Using Language Models Luyu Gao Zhuyun Dai Panupong Pasupat Anthony Chen Arun Tejasvi Chaganty ... Vincent Zhao Ni Lao Hongrae Lee Da-Cheng Juan Kelvin Guu HILM KELM 80 257 0 17 Oct 2022
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment Di Jin Zhijing Jin Qiufeng Wang Peter Szolovits SILM AAML 113 1,064 0 27 Jul 2019
TextBugger: Generating Adversarial Text Against Real-world Applications Jinfeng Li S. Ji Tianyu Du Bo Li Ting Wang SILM AAML 145 731 0 13 Dec 2018
Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers Ji Gao Jack Lanchantin M. Soffa Yanjun Qi AAML 109 716 0 13 Jan 2018