Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.07689
Cited By
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
13 November 2023
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAML
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MART: Improving LLM Safety with Multi-round Automatic Red-Teaming"
20 / 70 papers shown
Title
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Zhihao Xu
Ruixuan Huang
Changyu Chen
Shuai Wang
Xiting Wang
LLMSV
34
10
0
18 Apr 2024
Self-Supervised Visual Preference Alignment
Ke Zhu
Liang Zhao
Zheng Ge
Xiangyu Zhang
32
12
0
16 Apr 2024
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game
Qianqiao Xu
Zhiliang Tian
Hongyan Wu
Zhen Huang
Yiping Song
Feng Liu
Dongsheng Li
LLMAG
AAML
36
3
0
03 Apr 2024
Machine Unlearning for Traditional Models and Large Language Models: A Short Survey
Yi Xu
AILaw
MU
40
3
0
01 Apr 2024
Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices
Sara Abdali
Richard Anarfi
C. Barberan
Jia He
PILM
73
24
0
19 Mar 2024
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
Yi Luo
Zheng-Wen Lin
Yuhao Zhang
Jiashuo Sun
Chen Lin
Chengjin Xu
Xiangdong Su
Yelong Shen
Jian Guo
Yeyun Gong
LM&MA
ELM
ALM
AI4TS
30
1
0
18 Mar 2024
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
...
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
55
38
0
07 Mar 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
Sharath Chandra Raparthy
Andrei Lupu
Eric Hambro
Aram H. Markosyan
...
Minqi Jiang
Jack Parker-Holder
Jakob Foerster
Tim Rocktaschel
Roberta Raileanu
SyDa
80
63
0
26 Feb 2024
Fast Adversarial Attacks on Language Models In One GPU Minute
Vinu Sankar Sadasivan
Shoumik Saha
Gaurang Sriramanan
Priyatham Kattakinda
Atoosa Malemir Chegini
S. Feizi
MIALM
43
34
0
23 Feb 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou
Yufei Han
Haomin Zhuang
Kehan Guo
Zhenwen Liang
Hongyan Bao
Xiangliang Zhang
LLMAG
AAML
42
11
0
20 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
49
87
0
19 Feb 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong
Zhanhui Zhou
Chao Yang
Jing Shao
Yu Qiao
ELM
52
58
0
14 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
29
320
0
06 Feb 2024
Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer
Anusha Sinha
Wesley Hanwen Deng
Zachary Chase Lipton
Hoda Heidari
AAML
38
67
0
29 Jan 2024
Towards Conversational Diagnostic AI
Tao Tu
Anil Palepu
M. Schaekermann
Khaled Saab
Jan Freyberg
...
Katherine Chou
Greg S. Corrado
Yossi Matias
Alan Karthikesalingam
Vivek Natarajan
AI4MH
LM&MA
26
93
0
11 Jan 2024
MetaAID 2.5: A Secure Framework for Developing Metaverse Applications via Large Language Models
Hongyin Zhu
39
6
0
22 Dec 2023
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
AAML
27
20
0
19 Dec 2023
A Red Teaming Framework for Securing AI in Maritime Autonomous Systems
Mathew J. Walter
Aaron Barrett
Kimberly Tam
22
5
0
08 Dec 2023
Privacy in Large Language Models: Attacks, Defenses and Future Directions
Haoran Li
Yulin Chen
Jinglong Luo
Yan Kang
Xiaojin Zhang
Qi Hu
Chunkit Chan
Yangqiu Song
PILM
48
42
0
16 Oct 2023
On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook
Mingyuan Fan
Chengyu Wang
Cen Chen
Yang Liu
Jun Huang
HILM
39
3
0
31 Jul 2023
Previous
1
2