Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.12091
Cited By
Is poisoning a real threat to LLM alignment? Maybe more so than you think
17 June 2024
Pankayaraj Pathmanathan
Souradip Chakraborty
Xiangyu Liu
Yongyuan Liang
Furong Huang
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Is poisoning a real threat to LLM alignment? Maybe more so than you think"
11 / 11 papers shown
Title
Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts
Beining Xu
Arkaitz Zubiaga
DeLMO
73
0
0
23 Mar 2025
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models
Niccolò Turcato
Matteo Iovino
Aris Synodinos
Alberto Dalla Libera
R. Carli
Pietro Falco
LM&Ro
43
0
0
06 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
80
9
0
24 Feb 2025
UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning
Oubo Ma
L. Du
Yang Dai
Chunyi Zhou
Qingming Li
Yuwen Pu
Shouling Ji
46
0
0
28 Jan 2025
Gen-AI for User Safety: A Survey
Akshar Prabhu Desai
Tejasvi Ravi
Mohammad Luqman
Mohit Sharma
Nithya Kota
Pranjul Yadav
35
1
0
10 Nov 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILM
AAML
43
0
0
15 Oct 2024
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Tingchen Fu
Mrinank Sharma
Philip Torr
Shay B. Cohen
David M. Krueger
Fazl Barez
AAML
50
7
0
11 Oct 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
57
1
0
05 Sep 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
82
186
0
06 Feb 2024
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
97
124
0
01 May 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
328
11,953
0
04 Mar 2022
1