Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.17092
Cited By
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
24 June 2024
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAML
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models"
11 / 11 papers shown
Title
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
109
3
0
03 Feb 2025
Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace
Jinluan Yang
Anke Tang
Didi Zhu
Zhengyu Chen
Li Shen
Leilei Gan
MoMe
AAML
110
4
0
17 Oct 2024
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
Zhanhui Zhou
Jie Liu
Zhichen Dong
Jiaheng Liu
Chao Yang
Wanli Ouyang
Yu Qiao
30
20
0
19 Feb 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
60
284
0
12 Jan 2024
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
Yuanpu Cao
Bochuan Cao
Jinghui Chen
48
25
0
15 Nov 2023
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
AI4MH
43
23
0
31 Oct 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
160
1,376
0
27 Jul 2023
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Wei Ping
Weixin Chen
Hengzhi Pei
Chulin Xie
Mintong Kang
...
Zinan Lin
Yuk-Kit Cheng
Sanmi Koyejo
D. Song
Yue Liu
58
405
0
20 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
233
4,186
0
09 Jun 2023
Adversarial Unlearning of Backdoors via Implicit Hypergradient
Yi Zeng
Si-An Chen
Won Park
Z. Morley Mao
Ming Jin
R. Jia
AAML
98
174
0
07 Oct 2021
AdaSpeech: Adaptive Text to Speech for Custom Voice
Mingjian Chen
Xu Tan
Bohan Li
Yanqing Liu
Tao Qin
Sheng Zhao
Tie-Yan Liu
VLM
DiffM
71
189
0
01 Mar 2021
1