ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.09063
  4. Cited By
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

14 February 2024
Leo Schwinn
David Dobre
Sophie Xhonneux
Gauthier Gidel
Stephan Gunnemann
    AAML
ArXivPDFHTML

Papers citing "Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space"

36 / 86 papers shown
Title
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
  Against Extraction Attacks
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Joey Tianyi Zhou
KELM
AAML
71
102
0
29 Sep 2023
RAIN: Your Language Models Can Align Themselves without Finetuning
RAIN: Your Language Models Can Align Themselves without Finetuning
Yuhui Li
Fangyun Wei
Jinjing Zhao
Chao Zhang
Hongyang R. Zhang
SILM
48
113
0
13 Sep 2023
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Open Sesame! Universal Black Box Jailbreaking of Large Language Models
Raz Lapid
Ron Langberg
Moshe Sipper
AAML
59
111
0
04 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language
  Models
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
84
373
0
01 Sep 2023
Identifying and Mitigating the Security Risks of Generative AI
Identifying and Mitigating the Security Risks of Generative AI
Clark W. Barrett
Bradley L Boyd
Ellie Burzstein
Nicholas Carlini
Brad Chen
...
Zulfikar Ramzan
Khawaja Shams
D. Song
Ankur Taly
Diyi Yang
SILM
61
93
0
28 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
143
1,376
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
193
11,484
0
18 Jul 2023
Are aligned neural networks adversarially aligned?
Are aligned neural networks adversarially aligned?
Nicholas Carlini
Milad Nasr
Christopher A. Choquette-Choo
Matthew Jagielski
Irena Gao
...
Pang Wei Koh
Daphne Ippolito
Katherine Lee
Florian Tramèr
Ludwig Schmidt
AAML
35
231
0
26 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
182
4,085
0
09 Jun 2023
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers
Artidoro Pagnoni
Ari Holtzman
Luke Zettlemoyer
ALM
99
2,454
0
23 May 2023
Raising the Bar for Certified Adversarial Robustness with Diffusion
  Models
Raising the Bar for Certified Adversarial Robustness with Diffusion Models
Thomas Altstidl
David Dobre
Björn Eskofier
Gauthier Gidel
Leo Schwinn
DiffM
63
7
0
17 May 2023
Ewald-based Long-Range Message Passing for Molecular Graphs
Ewald-based Long-Range Message Passing for Molecular Graphs
Arthur Kosmala
Johannes Gasteiger
Nicholas Gao
Stephan Günnemann
98
29
0
08 Mar 2023
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt
  Tuning and Discovery
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery
Yuxin Wen
Neel Jain
John Kirchenbauer
Micah Goldblum
Jonas Geiping
Tom Goldstein
VLM
DiffM
61
259
1
07 Feb 2023
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
257
458
0
23 Aug 2022
Improving Robustness against Real-World and Worst-Case Distribution
  Shifts through Decision Region Quantification
Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification
Leo Schwinn
Leon Bungert
A. Nguyen
René Raab
Falk Pulsmeyer
Doina Precup
Björn Eskofier
Dario Zanca
OOD
59
14
0
19 May 2022
Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave
  Functions
Ab-Initio Potential Energy Surfaces by Pairing GNNs with Neural Wave Functions
Nicholas Gao
Stephan Günnemann
38
38
0
11 Oct 2021
Improved Text Classification via Contrastive Adversarial Training
Improved Text Classification via Contrastive Adversarial Training
Lin Pan
Chung-Wei Hang
Avirup Sil
Saloni Potdar
AAML
36
87
0
21 Jul 2021
Exploring Misclassifications of Robust Neural Networks to Enhance
  Adversarial Attacks
Exploring Misclassifications of Robust Neural Networks to Enhance Adversarial Attacks
Leo Schwinn
René Raab
A. Nguyen
Dario Zanca
Bjoern M. Eskofier
AAML
26
60
0
21 May 2021
Gradient-based Adversarial Attacks against Text Transformers
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
129
234
0
15 Apr 2021
CLIP: Cheap Lipschitz Training of Neural Networks
CLIP: Cheap Lipschitz Training of Neural Networks
Leon Bungert
René Raab
Tim Roith
Leo Schwinn
Daniel Tenbrinck
34
32
0
23 Mar 2021
Identifying Untrustworthy Predictions in Neural Networks by Geometric
  Gradient Analysis
Identifying Untrustworthy Predictions in Neural Networks by Geometric Gradient Analysis
Leo Schwinn
A. Nguyen
René Raab
Leon Bungert
Daniel Tenbrinck
Dario Zanca
Martin Burger
Bjoern M. Eskofier
AAML
33
15
0
24 Feb 2021
System Design for a Data-driven and Explainable Customer Sentiment
  Monitor
System Design for a Data-driven and Explainable Customer Sentiment Monitor
A. Nguyen
Stefan Foerstel
Thomas Kittler
Andrey Kurzyukov
Leo Schwinn
...
Tobias Hipp
Sun Da Jun
M. Schrapp
E. Rothgang
Bjoern M. Eskofier
11
8
0
11 Jan 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
381
1,868
0
14 Dec 2020
Dynamically Sampled Nonlocal Gradients for Stronger Adversarial Attacks
Dynamically Sampled Nonlocal Gradients for Stronger Adversarial Attacks
Leo Schwinn
An Nguyen
René Raab
Dario Zanca
Bjoern M. Eskofier
Daniel Tenbrinck
Martin Burger
AAML
27
8
0
05 Nov 2020
Time Matters: Time-Aware LSTMs for Predictive Business Process
  Monitoring
Time Matters: Time-Aware LSTMs for Predictive Business Process Monitoring
A. Nguyen
Srijeet Chatterjee
Sven Weinzierl
Leo Schwinn
Martin Matzner
Bjoern M. Eskofier
AI4TS
19
23
0
02 Oct 2020
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Pengcheng He
Xiaodong Liu
Jianfeng Gao
Weizhu Chen
AAML
92
2,682
0
05 Jun 2020
Language Models are Few-Shot Learners
Language Models are Few-Shot Learners
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
...
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei
BDL
393
41,106
0
28 May 2020
Adversarial Training for Large Neural Language Models
Adversarial Training for Large Neural Language Models
Xiaodong Liu
Hao Cheng
Pengcheng He
Weizhu Chen
Yu Wang
Hoifung Poon
Jianfeng Gao
AAML
65
184
0
20 Apr 2020
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language
  Models through Principled Regularized Optimization
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
Haoming Jiang
Pengcheng He
Weizhu Chen
Xiaodong Liu
Jianfeng Gao
T. Zhao
57
560
0
08 Nov 2019
FreeLB: Enhanced Adversarial Training for Natural Language Understanding
FreeLB: Enhanced Adversarial Training for Natural Language Understanding
Chen Zhu
Yu Cheng
Zhe Gan
S. Sun
Tom Goldstein
Jingjing Liu
AAML
251
438
0
25 Sep 2019
Designing and Interpreting Probes with Control Tasks
Designing and Interpreting Probes with Control Tasks
John Hewitt
Percy Liang
46
531
0
08 Sep 2019
Deep learning for time series classification: a review
Deep learning for time series classification: a review
Hassan Ismail Fawaz
Germain Forestier
J. Weber
L. Idoumghar
Pierre-Alain Muller
AI4TS
AI4CE
220
2,668
0
12 Sep 2018
Adversarial Attacks on Neural Networks for Graph Data
Adversarial Attacks on Neural Networks for Graph Data
Daniel Zügner
Amir Akbarnejad
Stephan Günnemann
GNN
AAML
OOD
94
1,060
0
21 May 2018
Towards Deep Learning Models Resistant to Adversarial Attacks
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry
Aleksandar Makelov
Ludwig Schmidt
Dimitris Tsipras
Adrian Vladu
SILM
OOD
193
11,962
0
19 Jun 2017
Universal adversarial perturbations
Universal adversarial perturbations
Seyed-Mohsen Moosavi-Dezfooli
Alhussein Fawzi
Omar Fawzi
P. Frossard
AAML
105
2,520
0
26 Oct 2016
Explaining and Harnessing Adversarial Examples
Explaining and Harnessing Adversarial Examples
Ian Goodfellow
Jonathon Shlens
Christian Szegedy
AAML
GAN
130
18,922
0
20 Dec 2014
Previous
12