ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.12091
  4. Cited By
Is poisoning a real threat to LLM alignment? Maybe more so than you think
v1v2v3v4 (latest)

Is poisoning a real threat to LLM alignment? Maybe more so than you think

17 June 2024
Pankayaraj Pathmanathan
Souradip Chakraborty
Xiangyu Liu
Yongyuan Liang
Furong Huang
    AAML
ArXiv (abs)PDFHTML

Papers citing "Is poisoning a real threat to LLM alignment? Maybe more so than you think"

19 / 19 papers shown
Title
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILMAAML
87
0
0
15 Oct 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning
LESS: Selecting Influential Data for Targeted Instruction Tuning
Mengzhou Xia
Sadhika Malladi
Suchin Gururangan
Sanjeev Arora
Danqi Chen
148
242
0
06 Feb 2024
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
124
628
0
05 Oct 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
389
4,139
0
29 May 2023
TRAK: Attributing Model Behavior at Scale
TRAK: Attributing Model Behavior at Scale
Sung Min Park
Kristian Georgiev
Andrew Ilyas
Guillaume Leclerc
Aleksander Madry
TDI
102
156
0
24 Mar 2023
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor
  Attacks to InstructGPT
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT
Jiawen Shi
Yixin Liu
Pan Zhou
Lichao Sun
SILM
39
81
0
21 Feb 2023
Principled Reinforcement Learning with Human Feedback from Pairwise or
  $K$-wise Comparisons
Principled Reinforcement Learning with Human Feedback from Pairwise or KKK-wise Comparisons
Banghua Zhu
Jiantao Jiao
Michael I. Jordan
OffRL
97
208
0
26 Jan 2023
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
256
2,611
0
12 Apr 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
886
13,176
0
04 Mar 2022
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRLAI4TSAI4CEALMAIMat
490
10,496
0
17 Jun 2021
Antipodes of Label Differential Privacy: PATE and ALIBI
Antipodes of Label Differential Privacy: PATE and ALIBI
Mani Malek
Ilya Mironov
Karthik Prasad
I. Shilov
Florian Tramèr
61
65
0
07 Jun 2021
Solving Heterogeneous General Equilibrium Economic Models with Deep
  Reinforcement Learning
Solving Heterogeneous General Equilibrium Economic Models with Deep Reinforcement Learning
Edward W. Hill
M. Bardoscia
A. Turrell
46
25
0
31 Mar 2021
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability
  of the Embedding Layers in NLP Models
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models
Wenkai Yang
Lei Li
Zhiyuan Zhang
Xuancheng Ren
Xu Sun
Bin He
SILM
92
153
0
29 Mar 2021
Deep Anomaly Detection with Outlier Exposure
Deep Anomaly Detection with Outlier Exposure
Dan Hendrycks
Mantas Mazeika
Thomas G. Dietterich
OODD
183
1,487
0
11 Dec 2018
Spectral Signatures in Backdoor Attacks
Spectral Signatures in Backdoor Attacks
Brandon Tran
Jerry Li
Aleksander Madry
AAML
91
795
0
01 Nov 2018
Label Sanitization against Label Flipping Poisoning Attacks
Label Sanitization against Label Flipping Poisoning Attacks
Andrea Paudice
Luis Muñoz-González
Emil C. Lupu
AAML
48
164
0
02 Mar 2018
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
Xinyun Chen
Chang-rui Liu
Yue Liu
Kimberly Lu
Basel Alomair
AAMLSILM
143
1,853
0
15 Dec 2017
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
535
19,265
0
20 Jul 2017
Watch and Learn: Optimizing from Revealed Preferences Feedback
Watch and Learn: Optimizing from Revealed Preferences Feedback
Aaron Roth
Jonathan R. Ullman
Zhiwei Steven Wu
67
70
0
04 Apr 2015
1