v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023

J. Zico Kolter

ArXiv (abs)PDF HTML Github (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown

Title
Large Language Model Alignment: A Survey Tianhao Shen Renren Jin Yufei Huang Chuang Liu Weilong Dong Zishan Guo Xinwei Wu Yan Liu Deyi Xiong LM&MA 112 205 0 26 Sep 2023
Can LLM-Generated Misinformation Be Detected? Canyu Chen Kai Shu DeLMO 195 182 0 25 Sep 2023
ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning Hosein Hasanbeig Hiteshi Sharma Leo Betthauser Felipe Vieira Frujeri Ida Momennejad 110 16 0 24 Sep 2023
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset Lianmin Zheng Wei-Lin Chiang Ying Sheng Tianle Li Siyuan Zhuang ... Zi Lin Eric P. Xing Joseph E. Gonzalez Ion Stoica Haotong Zhang 126 221 0 21 Sep 2023
Knowledge Sanitization of Large Language Models Yoichi Ishibashi Hidetoshi Shimodaira KELM 129 25 0 21 Sep 2023
How Robust is Google's Bard to Adversarial Image Attacks? Yinpeng Dong Huanran Chen Jiawei Chen Zhengwei Fang Xiaohu Yang Yichi Zhang Yu Tian Hang Su Jun Zhu AAML 118 116 0 21 Sep 2023
Model Leeching: An Extraction Attack Targeting LLMs Lewis Birch William Hackett Stefan Trawicki N. Suri Peter Garraghan 83 13 0 19 Sep 2023
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins Umar Iqbal Tadayoshi Kohno Franziska Roesner ELM SILM 126 49 0 19 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Jiahao Yu Xingwei Lin Zheng Yu Xinyu Xing SILM 230 353 0 19 Sep 2023
Understanding Catastrophic Forgetting in Language Models via Implicit Inference Suhas Kotha Jacob Mitchell Springer Aditi Raghunathan CLL 128 71 0 18 Sep 2023
Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts of Trustworthy AI Systems on the Environment and Human Society Jamell Dacon SILM 100 1 0 18 Sep 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM Bochuan Cao Yu Cao Lu Lin Jinghui Chen AAML 96 152 0 18 Sep 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions Federico Bianchi Mirac Suzgun Giuseppe Attanasio Paul Röttger Dan Jurafsky Tatsunori Hashimoto James Zou ALM LM&MA LRM 96 219 0 14 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning Yuhui Li Fangyun Wei Jinjing Zhao Chao Zhang Hongyang R. Zhang SILM 104 118 0 13 Sep 2023
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics Haoqin Tu Bingchen Zhao Chen Wei Cihang Xie MLLM 69 15 0 13 Sep 2023
Using Reed-Muller Codes for Classification with Rejection and Recovery Daniel Fentham David Parker Mark Ryan 45 0 0 12 Sep 2023
Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models Arka Dutta Adel Khorramrouz Sujan Dutta Ashiqur R. KhudaBukhsh 24 0 0 08 Sep 2023
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate Yuancheng Xu Chenghao Deng Yanchao Sun Ruijie Zheng Xiyao Wang Jieyu Zhao Furong Huang 81 4 0 07 Sep 2023
Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity A. Amirova T. Fteropoulli Nafiso Ahmed Martin R. Cowie Joel Z Leibo 89 11 0 06 Sep 2023
Demystifying RCE Vulnerabilities in LLM-Integrated Apps Tong Liu Zizhuang Deng Guozhu Meng Yuekang Li Kai Chen SILM 163 24 0 06 Sep 2023
Certifying LLM Safety against Adversarial Prompting Aounon Kumar Chirag Agarwal Suraj Srinivas Aaron Jiaxun Li Soheil Feizi Himabindu Lakkaraju AAML 153 197 0 06 Sep 2023
Open Sesame! Universal Black Box Jailbreaking of Large Language Models Raz Lapid Ron Langberg Moshe Sipper AAML 135 112 0 04 Sep 2023
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models Yue Zhang Yafu Li Leyang Cui Deng Cai Lemao Liu ... Longyue Wang Anh Tuan Luu Wei Bi Freda Shi Shuming Shi RALM LRM HILM 156 582 0 03 Sep 2023
Robust and Efficient Interference Neural Networks for Defending Against Adversarial Attacks in ImageNet Yunuo Xiong Shujuan Liu H. Xiong AAML 39 0 0 03 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language Models Neel Jain Avi Schwarzschild Yuxin Wen Gowthami Somepalli John Kirchenbauer Ping Yeh-Chiang Micah Goldblum Aniruddha Saha Jonas Geiping Tom Goldstein AAML 176 410 0 01 Sep 2023
Why do universal adversarial attacks work on large language models?: Geometry might be the answer Varshini Subhash Anna Bialas Weiwei Pan Finale Doshi-Velez AAML 91 11 0 01 Sep 2023
Image Hijacks: Adversarial Images can Control Generative Models at Runtime Luke Bailey Euan Ong Stuart J. Russell Scott Emmons VLM MLLM 93 88 0 01 Sep 2023
Large language models in medicine: the potentials and pitfalls J. Omiye Haiwen Gui Shawheen J. Rezaei James Zou Roxana Daneshjou LM&MA 97 81 0 31 Aug 2023
A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation Sahar Sadrizadeh Ljiljana Dolamic P. Frossard AAML SILM 83 2 0 29 Aug 2023
Identifying and Mitigating the Security Risks of Generative AI Clark W. Barrett Bradley L Boyd Ellie Burzstein Nicholas Carlini Brad Chen ... Zulfikar Ramzan Khawaja Shams Basel Alomair Ankur Taly Diyi Yang SILM 125 101 0 28 Aug 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions Peter S. Park Simon Goldstein Aidan O'Gara Michael Chen Dan Hendrycks 79 162 0 28 Aug 2023
Detecting Language Model Attacks with Perplexity Gabriel Alon Michael Kamfonas AAML 136 229 0 27 Aug 2023
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities Maximilian Mozes Xuanli He Bennett Kleinberg Lewis D. Griffin 87 87 0 24 Aug 2023
Adversarial Illusions in Multi-Modal Embeddings Tingwei Zhang Rishi Jha Eugene Bagdasaryan Vitaly Shmatikov AAML 138 11 0 22 Aug 2023
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions Pouya Pezeshkpour Estevam R. Hruschka LRM 78 146 0 22 Aug 2023
Enhancing Adversarial Attacks: The Similar Target Method Shuo Zhang Ziruo Wang Zikai Zhou Huanran Chen AAML 98 1 0 21 Aug 2023
On the Adversarial Robustness of Multi-Modal Foundation Models Christian Schlarmann Matthias Hein AAML 180 107 0 21 Aug 2023
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization Xiaoyu Ye Hao Huang Jiaqi An Yongtao Wang WIGM 85 23 0 19 Aug 2023
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment Rishabh Bhardwaj Soujanya Poria ELM 108 159 0 18 Aug 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes Anna Rogers A. Luccioni 160 21 0 14 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher Youliang Yuan Wenxiang Jiao Wenxuan Wang Jen-tse Huang Pinjia He Shuming Shi Zhaopeng Tu SILM 121 285 0 12 Aug 2023
CLEVA: Chinese Language Models EVAluation Platform Yanyang Li Jianqiao Zhao Duo Zheng Zi-Yuan Hu Zhi Chen ... Yongfeng Huang Shijia Huang Dahua Lin Michael R. Lyu Liwei Wang ALM ELM 100 11 0 09 Aug 2023
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models Xinyue Shen Zhenpeng Chen Michael Backes Yun Shen Yang Zhang SILM 165 302 0 07 Aug 2023
Why We Don't Have AGI Yet Peter Voss M. Jovanovic VLM 30 2 0 07 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models Paul Röttger Hannah Rose Kirk Bertie Vidgen Giuseppe Attanasio Federico Bianchi Dirk Hovy ALM ELM AILaw 120 154 0 02 Aug 2023
Getting from Generative AI to Trustworthy AI: What LLMs might learn from Cyc D. Lenat G. Marcus 67 30 0 31 Jul 2023
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models Erfan Shayegani Yue Dong Nael B. Abu-Ghazaleh 119 152 0 26 Jul 2023
Effective Prompt Extraction from Language Models Yiming Zhang Nicholas Carlini Daphne Ippolito MIACV SILM 105 43 0 13 Jul 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models Xiangyu Qi Kaixuan Huang Ashwinee Panda Peter Henderson Mengdi Wang Prateek Mittal AAML 120 172 0 22 Jun 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch Stephen Casper Jason Lin Joe Kwon Gatlen Culp Dylan Hadfield-Menell AAML 60 99 0 15 Jun 2023