Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2412.07192
Cited By
v1
v2 (latest)
PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips
10 December 2024
Zachary Coalson
Jeonghyun Woo
Shiyang Chen
Yu Sun
Lishan Yang
Prashant J. Nair
Bo Fang
Sanghyun Hong
AAML
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips"
50 / 56 papers shown
Title
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
155
217
0
17 Jun 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLM
LLMAG
229
506
0
13 Mar 2024
Stealing Part of a Production Language Model
Nicholas Carlini
Daniel Paleka
Krishnamurthy Dvijotham
Thomas Steinke
Jonathan Hayase
...
Arthur Conmy
Itay Yona
Eric Wallace
David Rolnick
Florian Tramèr
MLAU
AAML
58
85
0
11 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAG
AAML
64
77
0
02 Mar 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu
Pin-Yu Chen
Tsung-Yi Ho
AAML
61
32
0
01 Mar 2024
Massive Activations in Large Language Models
Mingjie Sun
Xinlei Chen
J. Zico Kolter
Zhuang Liu
116
81
0
27 Feb 2024
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
Heegyu Kim
Sehyun Yuk
Hyunsouk Cho
AAML
53
21
0
23 Feb 2024
tinyBenchmarks: evaluating LLMs with fewer examples
Felipe Maia Polo
Lucas Weber
Leshem Choshen
Yuekai Sun
Gongjun Xu
Mikhail Yurochkin
ELM
100
99
0
22 Feb 2024
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Yueqi Xie
Minghong Fang
Renjie Pi
Neil Zhenqiang Gong
113
34
0
21 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
184
111
0
14 Feb 2024
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Johannes Gasteiger
Stephan Günnemann
AAML
SILM
130
61
0
14 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
108
418
0
06 Feb 2024
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
Fan Yin
Hao Zhou
Fandong Meng
Jie Zhou
Kai-Wei Chang
Minlie Huang
Nanyun Peng
AAML
124
63
0
31 Jan 2024
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
119
87
0
30 Jan 2024
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang
Liang Ding
Lefei Zhang
Dacheng Tao
LLMSV
63
29
0
12 Jan 2024
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
95
33
0
08 Dec 2023
Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips
Ataberk Olgun
Majd Osseiran
A. G. Yaglikçi
Yahya Can Tugrul
Haocong Luo
Steve Rhyner
Behzad Salami
Juan Gómez Luna
Onur Mutlu
42
10
0
23 Oct 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
128
364
0
19 Oct 2023
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae
Jongwoo Ko
Hwanjun Song
SeYoung Yun
75
60
0
09 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Qi Zhang
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
81
190
0
04 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
90
332
0
03 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
217
352
0
19 Sep 2023
Detecting Language Model Attacks with Perplexity
Gabriel Alon
Michael Kamfonas
AAML
118
227
0
27 Aug 2023
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training
Jianshuo Dong
Han Qiu
Yiming Li
Tianwei Zhang
Yuan-Fang Li
Zeqi Lai
Chao Zhang
Shutao Xia
AAML
63
14
0
12 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
295
1,518
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
413
12,076
0
18 Jul 2023
Are aligned neural networks adversarially aligned?
Nicholas Carlini
Milad Nasr
Christopher A. Choquette-Choo
Matthew Jagielski
Irena Gao
...
Pang Wei Koh
Daphne Ippolito
Katherine Lee
Florian Tramèr
Ludwig Schmidt
AAML
79
253
0
26 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
389
4,169
0
29 May 2023
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALM
PILM
1.5K
13,472
0
27 Feb 2023
Aegis: Mitigating Targeted Bit-flip Attacks against Deep Neural Networks
Jialai Wang
Ziyuan Zhang
Meiqi Wang
Han Qiu
Tianwei Zhang
Qi Li
Zongpeng Li
Tao Wei
Chao Zhang
AAML
76
22
0
27 Feb 2023
TrojViT: Trojan Insertion in Vision Transformers
Mengxin Zheng
Qian Lou
Lei Jiang
152
56
0
27 Aug 2022
Versatile Weight Attack via Flipping Limited Bits
Jiawang Bai
Baoyuan Wu
Zhifeng Li
Shutao Xia
AAML
61
18
0
25 Jul 2022
The Privacy Onion Effect: Memorization is Relative
Nicholas Carlini
Matthew Jagielski
Chiyuan Zhang
Nicolas Papernot
Andreas Terzis
Florian Tramèr
PILM
MIACV
125
110
0
21 Jun 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
891
13,228
0
04 Mar 2022
Ethical and social risks of harm from Language Models
Laura Weidinger
John F. J. Mellor
Maribeth Rauh
Conor Griffin
J. Uesato
...
Lisa Anne Hendricks
William S. Isaac
Sean Legassick
G. Irving
Iason Gabriel
PILM
128
1,044
0
08 Dec 2021
DeepSteal: Advanced Model Extractions Leveraging Efficient Weight Stealing in Memories
Adnan Siraj Rakin
Md Hafizul Islam Chowdhuryy
Fan Yao
Deliang Fan
AAML
MIACV
79
117
0
08 Nov 2021
HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks on Deep Neural Networks
Mojan Javaheripi
F. Koushanfar
71
25
0
02 Nov 2021
Uncovering In-DRAM RowHammer Protection Mechanisms: A New Methodology, Custom RowHammer Patterns, and Implications
Hasan Hassan
Yahya Can Tugrul
Jeremie S. Kim
V. V. D. Veen
Kaveh Razavi
O. Mutlu
89
103
0
20 Oct 2021
Don't Knock! Rowhammer at the Backdoor of DNN Models
M. Tol
Saad Islam
Andrew J. Adiletta
B. Sunar
Ziming Zhang
AAML
77
17
0
14 Oct 2021
RA-BNN: Constructing Robust & Accurate Binary Neural Network to Simultaneously Defend Adversarial Bit-Flip Attack and Improve Accuracy
Adnan Siraj Rakin
Li Yang
Jingtao Li
Fan Yao
C. Chakrabarti
Yu Cao
Jae-sun Seo
Deliang Fan
AAML
MQ
68
27
0
22 Mar 2021
BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows
A. G. Yaglikçi
Minesh Patel
Jeremie S. Kim
Roknoddin Azizi
Ataberk Olgun
...
Jisung Park
Konstantinos Kanellopoulos
Taha Shahroodi
Saugata Ghose
O. Mutlu
104
145
0
11 Feb 2021
RADAR: Run-time Adversarial Weight Attack Detection and Accuracy Recovery
Jingtao Li
Adnan Siraj Rakin
Zhezhi He
Deliang Fan
C. Chakrabarti
AAML
59
42
0
20 Jan 2021
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
904
42,463
0
28 May 2020
TRRespass: Exploiting the Many Sides of Target Row Refresh
Pietro Frigo
Emanuele Vannacci
Hasan Hassan
V. V. D. Veen
O. Mutlu
Cristiano Giuffrida
H. Bos
Kaveh Razavi
53
232
0
03 Apr 2020
DeepHammer: Depleting the Intelligence of Deep Neural Networks through Targeted Chain of Bit Flips
Fan Yao
Adnan Siraj Rakin
Deliang Fan
AAML
93
161
0
30 Mar 2020
A Primer in BERTology: What we know about how BERT works
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
106
1,503
0
27 Feb 2020
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
568
42,677
0
03 Dec 2019
TBT: Targeted Neural Network Attack with Bit Trojan
Adnan Siraj Rakin
Zhezhi He
Deliang Fan
AAML
59
215
0
10 Sep 2019
High Accuracy and High Fidelity Extraction of Neural Networks
Matthew Jagielski
Nicholas Carlini
David Berthelot
Alexey Kurakin
Nicolas Papernot
MLAU
MIACV
81
381
0
03 Sep 2019
Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks
Sanghyun Hong
Pietro Frigo
Yigitcan Kaya
Cristiano Giuffrida
Tudor Dumitras
AAML
56
213
0
03 Jun 2019
1
2
Next