Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.07557
Cited By
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
11 February 2025
Shenyi Zhang
Yuchen Zhai
Keyan Guo
Hongxin Hu
Shengnan Guo
Zheng Fang
Lingchen Zhao
Chao Shen
Cong Wang
Qian Wang
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation"
12 / 12 papers shown
Title
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
Julien Piet
Xiao Huang
Dennis Jacob
Annabella Chow
Maha Alrashed
Geng Zhao
Zhanhao Hu
Chawin Sitawarin
Basel Alomair
David Wagner
AAML
85
0
0
28 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
78
2
0
13 Apr 2025
Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms
Shuoming Zhang
Jiacheng Zhao
Ruiyuan Xu
Xiaobing Feng
Huimin Cui
AAML
58
3
0
31 Mar 2025
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
75
45
0
31 May 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
121
186
0
02 Apr 2024
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
...
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
AI4MH
68
423
0
07 Dec 2023
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
Shuyu Jiang
Xingshu Chen
Rui Tang
56
24
0
16 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
78
243
0
05 Oct 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
165
1,376
0
27 Jul 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
758
13,788
0
15 Mar 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
156
1,583
0
15 Dec 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
150
351
0
21 Sep 2022
1