Is poisoning a real threat to LLM alignment? Maybe more so than you think

v1v2v3v4 (latest)

Is poisoning a real threat to LLM alignment? Maybe more so than you think

17 June 2024

Pankayaraj Pathmanathan

Souradip Chakraborty

ArXiv (abs)PDF HTML

Papers citing "Is poisoning a real threat to LLM alignment? Maybe more so than you think"

19 / 19 papers shown

Title
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment Pankayaraj Pathmanathan Udari Madhushani Sehwag Michael-Andrei Panaitescu-Liess Furong Huang SILM AAML 87 0 0 15 Oct 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning Mengzhou Xia Sadhika Malladi Suchin Gururangan Sanjeev Arora Danqi Chen 148 242 0 06 Feb 2024
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Xiangyu Qi Yi Zeng Tinghao Xie Pin-Yu Chen Ruoxi Jia Prateek Mittal Peter Henderson SILM 124 628 0 05 Oct 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafael Rafailov Archit Sharma E. Mitchell Stefano Ermon Christopher D. Manning Chelsea Finn ALM 389 4,139 0 29 May 2023
TRAK: Attributing Model Behavior at Scale Sung Min Park Kristian Georgiev Andrew Ilyas Guillaume Leclerc Aleksander Madry TDI 102 156 0 24 Mar 2023
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT Jiawen Shi Yixin Liu Pan Zhou Lichao Sun SILM 39 81 0 21 Feb 2023
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$ -wise Comparisons Banghua Zhu Jiantao Jiao Michael I. Jordan OffRL 97 208 0 26 Jan 2023
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Yuntao Bai Andy Jones Kamal Ndousse Amanda Askell Anna Chen ... Jack Clark Sam McCandlish C. Olah Benjamin Mann Jared Kaplan 256 2,611 0 12 Apr 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 886 13,176 0 04 Mar 2022
LoRA: Low-Rank Adaptation of Large Language Models J. E. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen OffRL AI4TS AI4CE ALM AIMat 490 10,496 0 17 Jun 2021
Antipodes of Label Differential Privacy: PATE and ALIBI Mani Malek Ilya Mironov Karthik Prasad I. Shilov Florian Tramèr 61 65 0 07 Jun 2021
Solving Heterogeneous General Equilibrium Economic Models with Deep Reinforcement Learning Edward W. Hill M. Bardoscia A. Turrell 46 25 0 31 Mar 2021
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models Wenkai Yang Lei Li Zhiyuan Zhang Xuancheng Ren Xu Sun Bin He SILM 92 153 0 29 Mar 2021
Deep Anomaly Detection with Outlier Exposure Dan Hendrycks Mantas Mazeika Thomas G. Dietterich OODD 183 1,487 0 11 Dec 2018
Spectral Signatures in Backdoor Attacks Brandon Tran Jerry Li Aleksander Madry AAML 91 795 0 01 Nov 2018
Label Sanitization against Label Flipping Poisoning Attacks Andrea Paudice Luis Muñoz-González Emil C. Lupu AAML 48 164 0 02 Mar 2018
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning Xinyun Chen Chang-rui Liu Yue Liu Kimberly Lu Basel Alomair AAML SILM 143 1,853 0 15 Dec 2017
Proximal Policy Optimization Algorithms John Schulman Filip Wolski Prafulla Dhariwal Alec Radford Oleg Klimov OffRL 535 19,265 0 20 Jul 2017
Watch and Learn: Optimizing from Revealed Preferences Feedback Aaron Roth Jonathan R. Ullman Zhiwei Steven Wu 67 70 0 04 Apr 2015