ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.08419
  4. Cited By
Jailbreaking Black Box Large Language Models in Twenty Queries
v1v2v3v4 (latest)

Jailbreaking Black Box Large Language Models in Twenty Queries

12 October 2023
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
    AAML
ArXiv (abs)PDFHTML

Papers citing "Jailbreaking Black Box Large Language Models in Twenty Queries"

50 / 196 papers shown
Title
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
Muyang Zheng
Yuanzhi Yao
C. D. Lin
Rui Wang
Meng Han
AAMLVLM
18
0
0
20 Jun 2025
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models
Lei Jiang
Zixun Zhang
Zizhou Wang
Xiaobing Sun
Zhen Li
Liangli Zhen
Xiaohua Xu
AAML
17
0
0
20 Jun 2025
Probing the Robustness of Large Language Models Safety to Latent Perturbations
Probing the Robustness of Large Language Models Safety to Latent Perturbations
Tianle Gu
Kexin Huang
Zongqi Wang
Yixu Wang
Jie Li
Yuanqi Yao
Yang Yao
Yujiu Yang
Yan Teng
Yingchun Wang
AAMLLLMSV
31
0
0
19 Jun 2025
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma
Yiqiao Jin
Vineeth Rakesh
Yingtong Dou
Menghai Pan
Mahashweta Das
Srijan Kumar
AAML
18
0
0
18 Jun 2025
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAMLELM
39
0
0
17 Jun 2025
Building Trustworthy AI by Addressing its 16+2 Desiderata with Goal-Directed Commonsense Reasoning
Building Trustworthy AI by Addressing its 16+2 Desiderata with Goal-Directed Commonsense Reasoning
Alexis R. Tudor
Yankai Zeng
Huaduo Wang
Joaquín Arias
Gopal Gupta
LRM
17
0
0
15 Jun 2025
Improving Large Language Model Safety with Contrastive Representation Learning
Improving Large Language Model Safety with Contrastive Representation Learning
Samuel Simko
Mrinmaya Sachan
Bernhard Schölkopf
Zhijing Jin
AAML
15
0
0
13 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
94
0
0
11 Jun 2025
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li
Qiang Sheng
Yehan Yang
Xueyao Zhang
Juan Cao
83
0
0
11 Jun 2025
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Mickel Liu
L. Jiang
Yancheng Liang
S. Du
Yejin Choi
Tim Althoff
Natasha Jaques
AAMLLRM
24
0
0
09 Jun 2025
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
T. Krauß
Hamid Dashtbani
Alexandra Dmitrienko
19
0
0
09 Jun 2025
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
Yukai Zhou
Sibei Yang
Wenjie Wang
AAML
17
0
0
09 Jun 2025
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue
Reza Abbasi-Asl
Ramtin Pedarsani
AAML
25
0
0
08 Jun 2025
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
Leheng Sheng
Changshuo Shen
Weixiang Zhao
Junfeng Fang
Xiaohao Liu
Zhenkai Liang
Xiang Wang
An Zhang
Tat-Seng Chua
LLMSV
32
0
0
08 Jun 2025
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
Ren-Jian Wang
Ke Xue
Zeyu Qin
Ziniu Li
Sheng Tang
Hao-Tian Li
Shengcai Liu
Chao Qian
AAML
20
0
0
08 Jun 2025
Automatic Robustness Stress Testing of LLMs as Mathematical Problem Solvers
Yutao Hou
Zeguan Xiao
Fei Yu
Yihan Jiang
Xuetao Wei
Hailiang Huang
Yun-Nung Chen
Guanhua Chen
LRM
111
0
0
05 Jun 2025
A Trustworthiness-based Metaphysics of Artificial Intelligence Systems
A Trustworthiness-based Metaphysics of Artificial Intelligence Systems
Andrea Ferrario
36
0
0
03 Jun 2025
Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS
Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS
Pengfei He
Yue Xing
Shen Dong
Juanhui Li
Zhenwei Dai
...
Hui Liu
Han Xu
Zhen Xiang
Charu C. Aggarwal
Hui Liu
LLMAG
86
0
0
02 Jun 2025
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
Youze Wang
Wenbo Hu
Yinpeng Dong
Jing Liu
Hanwang Zhang
Richang Hong
67
2
0
02 Jun 2025
The Security Threat of Compressed Projectors in Large Vision-Language Models
The Security Threat of Compressed Projectors in Large Vision-Language Models
Yudong Zhang
Ruobing Xie
Xingwu Sun
Jiansheng Chen
Zhanhui Kang
Di Wang
Yu Wang
21
0
0
31 May 2025
Existing Large Language Model Unlearning Evaluations Are Inconclusive
Existing Large Language Model Unlearning Evaluations Are Inconclusive
Zhili Feng
Yixuan Even Xu
Alexander Robey
Robert Kirk
Xander Davies
Yarin Gal
Avi Schwarzschild
J. Zico Kolter
MUELM
35
0
0
31 May 2025
LLM Agents Should Employ Security Principles
LLM Agents Should Employ Security Principles
Kaiyuan Zhang
Zian Su
Pin-Yu Chen
E. Bertino
Xiangyu Zhang
Ninghui Li
LLMAG
Presented at ResearchTrend Connect | LLMAG on 02 Jul 2025
102
1
0
29 May 2025
MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models
MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models
Mingyu Yu
Wei Wang
Y. X. Wei
Sujuan Qin
Fei Gao
Wenmin Li
AAML
42
0
0
29 May 2025
Position: Federated Foundation Language Model Post-Training Should Focus on Open-Source Models
Position: Federated Foundation Language Model Post-Training Should Focus on Open-Source Models
Nikita Agrawal
Simon Mertel
R. Mayer
78
0
0
29 May 2025
SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
Minrui Luo
Fuhang Kuang
Yu Wang
Zirui Liu
Tianxing He
CLL
62
0
0
29 May 2025
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Yao Huang
Yitong Sun
Shouwei Ruan
Yichi Zhang
Yinpeng Dong
Xingxing Wei
AAML
60
0
0
27 May 2025
Improved Representation Steering for Language Models
Improved Representation Steering for Language Models
Zhengxuan Wu
Qinan Yu
Aryaman Arora
Christopher D. Manning
Christopher Potts
LLMSV
76
0
0
27 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
36
0
0
26 May 2025
SGM: A Framework for Building Specification-Guided Moderation Filters
SGM: A Framework for Building Specification-Guided Moderation Filters
M. Fatehkia
Enes Altinisik
Husrev Taha Sencar
51
1
0
26 May 2025
Lifelong Safety Alignment for Language Models
Lifelong Safety Alignment for Language Models
Haoyu Wang
Zeyu Qin
Yifei Zhao
C. Du
Min Lin
Xueqian Wang
Tianyu Pang
KELMCLL
70
1
0
26 May 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Sangyeop Kim
Yohan Lee
Yongwoo Song
Kimin Lee
AAML
34
0
0
26 May 2025
Mitigating Deceptive Alignment via Self-Monitoring
Mitigating Deceptive Alignment via Self-Monitoring
Jiaming Ji
Wenqi Chen
Kaile Wang
Donghai Hong
Sitong Fang
...
Jiayi Zhou
Juntao Dai
Sirui Han
Yike Guo
Yaodong Yang
LRM
57
2
0
24 May 2025
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Jun Zhuang
Haibo Jin
Ye Zhang
Zhengjian Kang
Wenbin Zhang
Gaby G. Dagher
Haohan Wang
AAML
84
0
0
24 May 2025
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
Reva Schwartz
Rumman Chowdhury
Akash Kundu
Heather Frase
Marzieh Fadaee
...
Andrew Thompson
Maya Carlyle
Qinghua Lu
Matthew Holmes
Theodora Skeadas
71
0
0
24 May 2025
Safety Alignment via Constrained Knowledge Unlearning
Safety Alignment via Constrained Knowledge Unlearning
Zesheng Shi
Yucheng Zhou
Jing Li
MUKELMAAML
82
2
0
24 May 2025
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
Linbao Li
Y. Liu
Daojing He
Yu Li
AAML
119
0
0
23 May 2025
Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?
Chengda Lu
Xiaoyu Fan
Yu Huang
Rongwu Xu
Jijie Li
Wei Xu
LRM
68
0
0
23 May 2025
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Wenhan Chang
Tianqing Zhu
Yu Zhao
Shuangyong Song
Ping Xiong
Wanlei Zhou
Yongxiang Li
85
0
0
23 May 2025
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Huanran Chen
Yinpeng Dong
Zeming Wei
Yao Huang
Yichi Zhang
Hang Su
Jun Zhu
MoMe
94
1
0
23 May 2025
Finetuning-Activated Backdoors in LLMs
Finetuning-Activated Backdoors in LLMs
Thibaud Gloaguen
Mark Vero
Robin Staab
Martin Vechev
AAML
204
0
0
22 May 2025
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
Zhexin Zhang
Xian Qi Loye
Victor Shea-Jay Huang
Junxiao Yang
Qi Zhu
...
Fei Mi
Lifeng Shang
Yingkang Wang
Hongning Wang
Minlie Huang
LRM
84
0
0
21 May 2025
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion
Tiehan Cui
Yanxu Mao
Peipei Liu
Congying Liu
Datao You
AAML
61
1
0
20 May 2025
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Wonje Jeung
Sangyeon Yoon
Minsuk Kahng
Albert No
LRMLLMSV
200
1
0
20 May 2025
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
Guangke Chen
Fu Song
Zhe Zhao
Xiaojun Jia
Yang Liu
Yanchen Qiao
Weizhe Zhang
AuLLMAAML
115
1
0
20 May 2025
Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration
Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration
Tatia Tsmindashvili
Ana Kolkhidashvili
Dachi Kurtskhalia
Nino Maghlakelidze
Elene Mekvabishvili
Guram Dentoshvili
Orkhan Shamilov
Zaal Gachechiladze
Steven Saporta
David Dachi Choladze
185
0
0
18 May 2025
JULI: Jailbreak Large Language Models by Self-Introspection
JULI: Jailbreak Large Language Models by Self-Introspection
Jesson Wang
Zhanhao Hu
David Wagner
114
0
0
17 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
106
3
0
17 May 2025
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs
Ran Li
Hao Wang
Chengzhi Mao
AAML
95
1
0
16 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng Lin
Binxing Fang
PILM
126
0
0
15 May 2025
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
73
0
0
12 May 2025
1234
Next