ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.07192
  4. Cited By
PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips
v1v2 (latest)

PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips

10 December 2024
Zachary Coalson
Jeonghyun Woo
Shiyang Chen
Yu Sun
Lishan Yang
Prashant J. Nair
Bo Fang
Sanghyun Hong
    AAML
ArXiv (abs)PDFHTML

Papers citing "PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips"

50 / 56 papers shown
Title
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
155
217
0
17 Jun 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLMLLMAG
229
506
0
13 Mar 2024
Stealing Part of a Production Language Model
Stealing Part of a Production Language Model
Nicholas Carlini
Daniel Paleka
Krishnamurthy Dvijotham
Thomas Steinke
Jonathan Hayase
...
Arthur Conmy
Itay Yona
Eric Wallace
David Rolnick
Florian Tramèr
MLAUAAML
58
85
0
11 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAGAAML
64
77
0
02 Mar 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
  Exploring Refusal Loss Landscapes
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu
Pin-Yu Chen
Tsung-Yi Ho
AAML
61
32
0
01 Mar 2024
Massive Activations in Large Language Models
Massive Activations in Large Language Models
Mingjie Sun
Xinlei Chen
J. Zico Kolter
Zhuang Liu
116
81
0
27 Feb 2024
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks
  with Self-Refinement
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
Heegyu Kim
Sehyun Yuk
Hyunsouk Cho
AAML
53
21
0
23 Feb 2024
tinyBenchmarks: evaluating LLMs with fewer examples
tinyBenchmarks: evaluating LLMs with fewer examples
Felipe Maia Polo
Lucas Weber
Leshem Choshen
Yuekai Sun
Gongjun Xu
Mikhail Yurochkin
ELM
100
99
0
22 Feb 2024
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical
  Gradient Analysis
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Yueqi Xie
Minghong Fang
Renjie Pi
Neil Zhenqiang Gong
113
34
0
21 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware
  Decoding
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
184
111
0
14 Feb 2024
Attacking Large Language Models with Projected Gradient Descent
Attacking Large Language Models with Projected Gradient Descent
Simon Geisler
Tom Wollschlager
M. H. I. Abdalla
Johannes Gasteiger
Stephan Günnemann
AAMLSILM
130
61
0
14 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
108
418
0
06 Feb 2024
On Prompt-Driven Safeguarding for Large Language Models
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
Fan Yin
Hao Zhou
Fandong Meng
Jie Zhou
Kai-Wei Chang
Minlie Huang
Nanyun Peng
AAML
122
63
0
31 Jan 2024
Robust Prompt Optimization for Defending Language Models Against
  Jailbreaking Attacks
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
119
87
0
30 Jan 2024
Intention Analysis Makes LLMs A Good Jailbreak Defender
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang
Liang Ding
Lefei Zhang
Dacheng Tao
LLMSV
63
29
0
12 Jan 2024
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language
  Models with 3D Parallelism
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
95
33
0
08 Dec 2023
Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study
  on HBM2 DRAM Chips
Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips
Ataberk Olgun
Majd Osseiran
A. G. Yaglikçi
Yahya Can Tugrul
Haocong Luo
Steve Rhyner
Behzad Salami
Juan Gómez Luna
Onur Mutlu
42
10
0
23 Oct 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
128
364
0
19 Oct 2023
Fast and Robust Early-Exiting Framework for Autoregressive Language
  Models with Synchronized Parallel Decoding
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae
Jongwoo Ko
Hwanjun Song
SeYoung Yun
75
60
0
09 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Qi Zhang
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
81
190
0
04 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language
  Models
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
90
332
0
03 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated
  Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
213
352
0
19 Sep 2023
Detecting Language Model Attacks with Perplexity
Detecting Language Model Attacks with Perplexity
Gabriel Alon
Michael Kamfonas
AAML
118
227
0
27 Aug 2023
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training
Jianshuo Dong
Han Qiu
Yiming Li
Tianwei Zhang
Yuan-Fang Li
Zeqi Lai
Chao Zhang
Shutao Xia
AAML
63
14
0
12 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
295
1,518
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
413
12,076
0
18 Jul 2023
Are aligned neural networks adversarially aligned?
Are aligned neural networks adversarially aligned?
Nicholas Carlini
Milad Nasr
Christopher A. Choquette-Choo
Matthew Jagielski
Irena Gao
...
Pang Wei Koh
Daphne Ippolito
Katherine Lee
Florian Tramèr
Ludwig Schmidt
AAML
79
253
0
26 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
389
4,169
0
29 May 2023
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron
Thibaut Lavril
Gautier Izacard
Xavier Martinet
Marie-Anne Lachaux
...
Faisal Azhar
Aurelien Rodriguez
Armand Joulin
Edouard Grave
Guillaume Lample
ALMPILM
1.5K
13,472
0
27 Feb 2023
Aegis: Mitigating Targeted Bit-flip Attacks against Deep Neural Networks
Aegis: Mitigating Targeted Bit-flip Attacks against Deep Neural Networks
Jialai Wang
Ziyuan Zhang
Meiqi Wang
Han Qiu
Tianwei Zhang
Qi Li
Zongpeng Li
Tao Wei
Chao Zhang
AAML
74
22
0
27 Feb 2023
TrojViT: Trojan Insertion in Vision Transformers
TrojViT: Trojan Insertion in Vision Transformers
Mengxin Zheng
Qian Lou
Lei Jiang
152
56
0
27 Aug 2022
Versatile Weight Attack via Flipping Limited Bits
Versatile Weight Attack via Flipping Limited Bits
Jiawang Bai
Baoyuan Wu
Zhifeng Li
Shutao Xia
AAML
53
18
0
25 Jul 2022
The Privacy Onion Effect: Memorization is Relative
The Privacy Onion Effect: Memorization is Relative
Nicholas Carlini
Matthew Jagielski
Chiyuan Zhang
Nicolas Papernot
Andreas Terzis
Florian Tramèr
PILMMIACV
123
110
0
21 Jun 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLMALM
891
13,228
0
04 Mar 2022
Ethical and social risks of harm from Language Models
Ethical and social risks of harm from Language Models
Laura Weidinger
John F. J. Mellor
Maribeth Rauh
Conor Griffin
J. Uesato
...
Lisa Anne Hendricks
William S. Isaac
Sean Legassick
G. Irving
Iason Gabriel
PILM
128
1,044
0
08 Dec 2021
DeepSteal: Advanced Model Extractions Leveraging Efficient Weight
  Stealing in Memories
DeepSteal: Advanced Model Extractions Leveraging Efficient Weight Stealing in Memories
Adnan Siraj Rakin
Md Hafizul Islam Chowdhuryy
Fan Yao
Deliang Fan
AAMLMIACV
79
117
0
08 Nov 2021
HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks
  on Deep Neural Networks
HASHTAG: Hash Signatures for Online Detection of Fault-Injection Attacks on Deep Neural Networks
Mojan Javaheripi
F. Koushanfar
71
25
0
02 Nov 2021
Uncovering In-DRAM RowHammer Protection Mechanisms: A New Methodology,
  Custom RowHammer Patterns, and Implications
Uncovering In-DRAM RowHammer Protection Mechanisms: A New Methodology, Custom RowHammer Patterns, and Implications
Hasan Hassan
Yahya Can Tugrul
Jeremie S. Kim
V. V. D. Veen
Kaveh Razavi
O. Mutlu
89
103
0
20 Oct 2021
Don't Knock! Rowhammer at the Backdoor of DNN Models
Don't Knock! Rowhammer at the Backdoor of DNN Models
M. Tol
Saad Islam
Andrew J. Adiletta
B. Sunar
Ziming Zhang
AAML
77
17
0
14 Oct 2021
RA-BNN: Constructing Robust & Accurate Binary Neural Network to
  Simultaneously Defend Adversarial Bit-Flip Attack and Improve Accuracy
RA-BNN: Constructing Robust & Accurate Binary Neural Network to Simultaneously Defend Adversarial Bit-Flip Attack and Improve Accuracy
Adnan Siraj Rakin
Li Yang
Jingtao Li
Fan Yao
C. Chakrabarti
Yu Cao
Jae-sun Seo
Deliang Fan
AAMLMQ
68
27
0
22 Mar 2021
BlockHammer: Preventing RowHammer at Low Cost by Blacklisting
  Rapidly-Accessed DRAM Rows
BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows
A. G. Yaglikçi
Minesh Patel
Jeremie S. Kim
Roknoddin Azizi
Ataberk Olgun
...
Jisung Park
Konstantinos Kanellopoulos
Taha Shahroodi
Saugata Ghose
O. Mutlu
104
145
0
11 Feb 2021
RADAR: Run-time Adversarial Weight Attack Detection and Accuracy
  Recovery
RADAR: Run-time Adversarial Weight Attack Detection and Accuracy Recovery
Jingtao Li
Adnan Siraj Rakin
Zhezhi He
Deliang Fan
C. Chakrabarti
AAML
59
42
0
20 Jan 2021
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
904
42,463
0
28 May 2020
TRRespass: Exploiting the Many Sides of Target Row Refresh
TRRespass: Exploiting the Many Sides of Target Row Refresh
Pietro Frigo
Emanuele Vannacci
Hasan Hassan
V. V. D. Veen
O. Mutlu
Cristiano Giuffrida
H. Bos
Kaveh Razavi
53
232
0
03 Apr 2020
DeepHammer: Depleting the Intelligence of Deep Neural Networks through
  Targeted Chain of Bit Flips
DeepHammer: Depleting the Intelligence of Deep Neural Networks through Targeted Chain of Bit Flips
Fan Yao
Adnan Siraj Rakin
Deliang Fan
AAML
93
161
0
30 Mar 2020
A Primer in BERTology: What we know about how BERT works
A Primer in BERTology: What we know about how BERT works
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
106
1,503
0
27 Feb 2020
PyTorch: An Imperative Style, High-Performance Deep Learning Library
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
James Bradbury
...
Sasank Chilamkurthy
Benoit Steiner
Lu Fang
Junjie Bai
Soumith Chintala
ODL
568
42,677
0
03 Dec 2019
TBT: Targeted Neural Network Attack with Bit Trojan
TBT: Targeted Neural Network Attack with Bit Trojan
Adnan Siraj Rakin
Zhezhi He
Deliang Fan
AAML
59
215
0
10 Sep 2019
High Accuracy and High Fidelity Extraction of Neural Networks
High Accuracy and High Fidelity Extraction of Neural Networks
Matthew Jagielski
Nicholas Carlini
David Berthelot
Alexey Kurakin
Nicolas Papernot
MLAUMIACV
81
381
0
03 Sep 2019
Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural
  Networks Under Hardware Fault Attacks
Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks
Sanghyun Hong
Pietro Frigo
Yigitcan Kaya
Cristiano Giuffrida
Tudor Dumitras
AAML
56
213
0
03 Jun 2019
12
Next