Is poisoning a real threat to LLM alignment? Maybe more so than you think

Is poisoning a real threat to LLM alignment? Maybe more so than you think

17 June 2024

Pankayaraj Pathmanathan

Souradip Chakraborty

Furong Huang

Papers citing "Is poisoning a real threat to LLM alignment? Maybe more so than you think"

11 / 11 papers shown

Title
Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts Beining Xu Arkaitz Zubiaga DeLMO 73 0 0 23 Mar 2025
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models Niccolò Turcato Matteo Iovino Aris Synodinos Alberto Dalla Libera R. Carli Pietro Falco LM&Ro 43 0 0 06 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Jan Betley Daniel Tan Niels Warncke Anna Sztyber-Betley Xuchan Bao Martín Soto Nathan Labenz Owain Evans AAML 80 9 0 24 Feb 2025
UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning Oubo Ma L. Du Yang Dai Chunyi Zhou Qingming Li Yuwen Pu Shouling Ji 46 0 0 28 Jan 2025
Gen-AI for User Safety: A Survey Akshar Prabhu Desai Tejasvi Ravi Mohammad Luqman Mohit Sharma Nithya Kota Pranjul Yadav 35 1 0 10 Nov 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment Pankayaraj Pathmanathan Udari Madhushani Sehwag Michael-Andrei Panaitescu-Liess Furong Huang SILM AAML 43 0 0 15 Oct 2024
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning Tingchen Fu Mrinank Sharma Philip Torr Shay B. Cohen David M. Krueger Fazl Barez AAML 50 7 0 11 Oct 2024
Recent Advances in Attack and Defense Approaches of Large Language Models Jing Cui Yishi Xu Zhewei Huang Shuchang Zhou Jianbin Jiao Junge Zhang PILM AAML 57 1 0 05 Sep 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning Mengzhou Xia Sadhika Malladi Suchin Gururangan Sanjeev Arora Danqi Chen 82 186 0 06 Feb 2024
Poisoning Language Models During Instruction Tuning Alexander Wan Eric Wallace Sheng Shen Dan Klein SILM 97 124 0 01 May 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 328 11,953 0 04 Mar 2022