Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.17256
Cited By
Weak-to-Strong Jailbreaking on Large Language Models
30 January 2024
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Yang Wang
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Weak-to-Strong Jailbreaking on Large Language Models"
44 / 44 papers shown
Title
A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models
Carlos Peláez-González
Andrés Herrera-Poyatos
Cristina Zuheros
David Herrera-Poyatos
Virilo Tejedor
F. Herrera
AAML
21
0
0
07 Apr 2025
Robust Multi-Objective Controlled Decoding of Large Language Models
Seongho Son
William Bankes
Sangwoong Yoon
Shyam Sundhar Ramesh
Xiaohang Tang
Ilija Bogunovic
39
0
0
11 Mar 2025
Improving LLM Safety Alignment with Dual-Objective Optimization
Xuandong Zhao
Will Cai
Tianneng Shi
David Huang
Licong Lin
Song Mei
Dawn Song
AAML
MU
67
1
0
05 Mar 2025
Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences
Shanshan Han
Salman Avestimehr
Chaoyang He
71
0
0
12 Feb 2025
Confidence Elicitation: A New Attack Vector for Large Language Models
Brian Formento
Chuan-Sheng Foo
See-Kiong Ng
AAML
99
0
0
07 Feb 2025
Peering Behind the Shield: Guardrail Identification in Large Language Models
Ziqing Yang
Yixin Wu
Rui Wen
Michael Backes
Yang Zhang
60
1
0
03 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
110
9
0
28 Jan 2025
FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models
Zhuo Chen
Y. Gong
Miaokun Chen
Haotan Liu
Qikai Cheng
Fan Zhang
Wei-Tsung Lu
Xiaozhong Liu
J. Liu
XiaoFeng Wang
AAML
44
1
0
06 Jan 2025
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
135
0
0
06 Nov 2024
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
Xinyuan Wang
Victor Shea-Jay Huang
Renmiao Chen
Hao Wang
C. Pan
Lei Sha
Minlie Huang
AAML
23
2
0
13 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Z. Z. Ren
57
6
0
10 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
121
2
0
09 Oct 2024
Attention Prompting on Image for Large Vision-Language Models
Runpeng Yu
Weihao Yu
Xinchao Wang
VLM
37
6
0
25 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
54
1
0
05 Sep 2024
Learning Fine-Grained Grounded Citations for Attributed Large Language Models
Lei Huang
Xiaocheng Feng
Weitao Ma
Yuxuan Gu
Weihong Zhong
...
Weijiang Yu
Weihua Peng
Duyu Tang
Dandan Tu
Bing Qin
HILM
24
4
0
08 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
38
8
0
02 Aug 2024
LLMmap: Fingerprinting For Large Language Models
Dario Pasquini
Evgenios M. Kornaropoulos
G. Ateniese
50
6
0
22 Jul 2024
Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models
Zhuo Chen
Jiawei Liu
Haotan Liu
Qikai Cheng
Fan Zhang
Wei Lu
Xiaozhong Liu
AAML
34
6
0
18 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
34
80
0
05 Jul 2024
Decoding-Time Language Model Alignment with Multiple Objectives
Ruizhe Shi
Yifang Chen
Yushi Hu
Alisa Liu
Hannaneh Hajishirzi
Noah A. Smith
Simon Du
44
30
0
27 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
A. Grama
Ruqi Zhang
Ruqi Zhang
AI4TS
47
9
0
24 Jun 2024
TorchOpera: A Compound AI System for LLM Safety
Shanshan Han
Yuhang Yao
Zijian Hu
Dimitris Stripelis
Zhaozhuo Xu
Chaoyang He
LLMAG
36
0
0
16 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
42
10
0
13 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
39
72
0
10 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
44
8
0
06 Jun 2024
Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature
Tong Zhou
Xuandong Zhao
Xiaolin Xu
Shaolei Ren
30
6
0
04 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Xiaochun Cao
Min-Bin Lin
AAML
44
22
0
31 May 2024
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability
Yanrui Du
Sendong Zhao
Danyang Zhao
Ming Ma
Yuhan Chen
Liangyu Huo
Qing Yang
Dongliang Xu
Bing Qin
24
6
0
23 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review
Jie Zhang
Haoyu Bu
Hui Wen
Yu Chen
Lun Li
Hongsong Zhu
30
36
0
06 May 2024
Don't Say No: Jailbreaking LLM by Suppressing Refusal
Yukai Zhou
Wenjie Wang
AAML
39
15
0
25 Apr 2024
Protecting Your LLMs with Information Bottleneck
Zichuan Liu
Zefan Wang
Linjie Xu
Jinyu Wang
Lei Song
Tianchun Wang
Chunlin Chen
Wei Cheng
Jiang Bian
KELM
AAML
56
15
0
22 Apr 2024
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus
Arman Zharmagambetov
Chuan Guo
Brandon Amos
Yuandong Tian
AAML
55
55
0
21 Apr 2024
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
Zeyi Liao
Huan Sun
AAML
39
73
0
11 Apr 2024
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
Somnath Banerjee
Sayan Layek
Rima Hazra
Animesh Mukherjee
24
11
0
23 Feb 2024
How Susceptible are Large Language Models to Ideological Manipulation?
Kai Chen
Zihao He
Jun Yan
Taiwei Shi
Kristina Lerman
32
10
0
18 Feb 2024
Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs
Xuandong Zhao
Lei Li
Yu-Xiang Wang
55
12
0
08 Feb 2024
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
Rima Hazra
Sayan Layek
Somnath Banerjee
Soujanya Poria
KELM
26
17
0
19 Jan 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
68
95
0
03 Jan 2024
Jatmo: Prompt Injection Defense by Task-Specific Finetuning
Julien Piet
Maha Alrashed
Chawin Sitawarin
Sizhe Chen
Zeming Wei
Elizabeth Sun
Basel Alomair
David A. Wagner
AAML
SyDa
75
52
0
29 Dec 2023
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused?
Hangfan Zhang
Zhimeng Guo
Huaisheng Zhu
Bochuan Cao
Lu Lin
Jinyuan Jia
Jinghui Chen
Di Wu
73
23
0
02 Oct 2023
Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models
Yugeng Liu
Tianshuo Cong
Zhengyu Zhao
Michael Backes
Yun Shen
Yang Zhang
AAML
33
6
0
15 Aug 2023
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
92
124
0
01 May 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,915
0
04 Mar 2022
Generating Natural Language Adversarial Examples
M. Alzantot
Yash Sharma
Ahmed Elgohary
Bo-Jhang Ho
Mani B. Srivastava
Kai-Wei Chang
AAML
245
914
0
21 Apr 2018
1