Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.03693
Cited By
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
5 October 2023
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"
46 / 396 papers shown
Title
Safety of Multimodal Large Language Models on Images and Texts
Xin Liu
Yichen Zhu
Yunshi Lan
Chao Yang
Yu Qiao
26
28
0
01 Feb 2024
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Yang Wang
34
54
0
30 Jan 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
34
78
0
25 Jan 2024
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
Pengyu Wang
Dong Zhang
Linyang Li
Chenkun Tan
Xinghao Wang
Ke Ren
Botian Jiang
Xipeng Qiu
LLMSV
21
41
0
20 Jan 2024
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan
Zhiwei He
Lingzhong Dong
Yiming Wang
Ruijie Zhao
...
Binglin Zhou
Fangqi Li
Zhuosheng Zhang
Rui Wang
Gongshen Liu
ELM
34
61
0
18 Jan 2024
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
Dong Shu
Mingyu Jin
Suiyuan Zhu
Beichen Wang
Zihao Zhou
Chong Zhang
Yongfeng Zhang
ELM
47
12
0
17 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
18
256
0
12 Jan 2024
Examining Forgetting in Continual Pre-training of Aligned Large Language Models
Chen An Li
Hung-Yi Lee
CLL
KELM
31
8
0
06 Jan 2024
Large Language Models for Social Networks: Applications, Challenges, and Solutions
Jingying Zeng
Richard Huang
Waleed Malik
Langxuan Yin
Bojan Babic
Danny Shacham
Xiao Yan
Jaewon Yang
Qi He
22
7
0
04 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
74
96
0
03 Jan 2024
Alleviating Hallucinations of Large Language Models through Induced Hallucinations
Yue Zhang
Leyang Cui
Wei Bi
Shuming Shi
HILM
42
50
0
25 Dec 2023
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
Alan Chan
Ben Bucknall
Herbie Bradley
David M. Krueger
14
6
0
22 Dec 2023
MetaAID 2.5: A Secure Framework for Developing Metaverse Applications via Large Language Models
Hongyin Zhu
36
6
0
22 Dec 2023
Exploiting Novel GPT-4 APIs
Kellin Pelrine
Mohammad Taufeeque
Michal Zajkac
Euan McLean
Adam Gleave
SILM
23
20
0
21 Dec 2023
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao
Zhun Deng
David Madras
James Zou
Mengye Ren
MU
KELM
CLL
85
16
0
20 Dec 2023
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
AAML
19
19
0
19 Dec 2023
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
Yu Fu
Yufei Li
Wen Xiao
Cong Liu
Yue Dong
AAML
36
5
0
12 Dec 2023
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Zhuo Zhang
Guangyu Shen
Guanhong Tao
Shuyang Cheng
Xiangyu Zhang
35
12
0
08 Dec 2023
LLM as OS, Agents as Apps: Envisioning AIOS, Agents and the AIOS-Agent Ecosystem
Yingqiang Ge
Yujie Ren
Wenyue Hua
Shuyuan Xu
Juntao Tan
Yongfeng Zhang
LLMAG
23
27
0
06 Dec 2023
The Philosopher's Stone: Trojaning Plugins of Large Language Models
Tian Dong
Minhui Xue
Guoxing Chen
Rayne Holland
Shaofeng Li
Yan Meng
Zhen Liu
Haojin Zhu
AAML
22
9
0
01 Dec 2023
Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework
Matthew Pisano
Peter Ly
Abraham Sanders
Bingsheng Yao
Dakuo Wang
T. Strzalkowski
Mei Si
AAML
22
24
0
16 Nov 2023
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
Yuanpu Cao
Bochuan Cao
Jinghui Chen
26
24
0
15 Nov 2023
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
Bertie Vidgen
Nino Scherrer
Hannah Rose Kirk
Rebecca Qian
Anand Kannappan
Scott A. Hale
Paul Röttger
ALM
ELM
27
27
0
14 Nov 2023
Removing RLHF Protections in GPT-4 via Fine-Tuning
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MU
AAML
15
92
0
09 Nov 2023
Making Harmful Behaviors Unlearnable for Large Language Models
Xin Zhou
Yi Lu
Ruotian Ma
Tao Gui
Qi Zhang
Xuanjing Huang
MU
41
9
0
02 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
19
71
0
02 Nov 2023
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
AI4MH
15
23
0
31 Oct 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
20
82
0
31 Oct 2023
The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks
Xiaoyi Chen
Siyuan Tang
Rui Zhu
Shijun Yan
Lei Jin
Zihao Wang
Liya Su
Zhikun Zhang
Xiaofeng Wang
Haixu Tang
AAML
PILM
18
17
0
24 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
27
268
0
10 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
45
236
0
10 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
40
215
0
05 Oct 2023
All Languages Matter: On the Multilingual Safety of Large Language Models
Wenxuan Wang
Zhaopeng Tu
Chang Chen
Youliang Yuan
Jen-tse Huang
Wenxiang Jiao
Michael R. Lyu
ALM
LRM
39
31
0
02 Oct 2023
Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review
Pengzhou Cheng
Zongru Wu
Wei Du
Haodong Zhao
Wei Lu
Gongshen Liu
SILM
AAML
29
17
0
12 Sep 2023
Decolonial AI Alignment: Openness, Viśe\d{s}a-Dharma, and Including Excluded Knowledges
Kush R. Varshney
41
2
0
10 Sep 2023
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Luke Bailey
Euan Ong
Stuart J. Russell
Scott Emmons
VLM
MLLM
30
79
0
01 Sep 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
...
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
47
473
0
27 Jul 2023
AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks
Alexander Tornede
Difan Deng
Theresa Eimer
Joseph Giovanelli
Aditya Mohan
...
Sarah Segel
Daphne Theodorakopoulos
Tanja Tornede
Henning Wachsmuth
Marius Lindauer
28
23
0
13 Jun 2023
ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs
Zihao Zhao
Sheng Wang
Jinchen Gu
Yitao Zhu
Lanzhuju Mei
Zixu Zhuang
Zhiming Cui
Qian Wang
Dinggang Shen
LM&MA
31
36
0
25 May 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
225
446
0
23 Aug 2022
BppAttack: Stealthy and Efficient Trojan Attacks against Deep Neural Networks via Image Quantization and Contrastive Adversarial Learning
Zhenting Wang
Juan Zhai
Shiqing Ma
AAML
131
97
0
26 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,953
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
376
8,495
0
28 Jan 2022
Towards Continual Knowledge Learning of Language Models
Joel Jang
Seonghyeon Ye
Sohee Yang
Joongbo Shin
Janghoon Han
Gyeonghun Kim
Stanley Jungkyu Choi
Minjoon Seo
CLL
KELM
230
151
0
07 Oct 2021
Entity-Based Knowledge Conflicts in Question Answering
Shayne Longpre
Kartik Perisetla
Anthony Chen
Nikhil Ramesh
Chris DuBois
Sameer Singh
HILM
245
237
0
10 Sep 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
280
3,848
0
18 Apr 2021
Previous
1
2
3
4
5
6
7
8