ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.06674
  4. Cited By
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

7 December 2023
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
Yuning Mao
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
    AI4MH
ArXiv (abs)PDFHTML

Papers citing "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations"

36 / 336 papers shown
Title
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
Xijia Tao
Shuai Zhong
Lei Li
Qi Liu
Lingpeng Kong
128
30
0
05 Mar 2024
Breaking Down the Defenses: A Comparative Survey of Attacks on Large
  Language Models
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
Arijit Ghosh Chowdhury
Md. Mofijul Islam
Vaibhav Kumar
F. H. Shezan
Vaibhav Kumar
Vinija Jain
Aman Chadha
AAMLPILM
84
34
0
03 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAGAAML
79
77
0
02 Mar 2024
Speak Out of Turn: Safety Vulnerability of Large Language Models in
  Multi-turn Dialogue
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
Zhenhong Zhou
Jiuyang Xiang
Haopeng Chen
Quan Liu
Zherui Li
Sen Su
102
25
0
27 Feb 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
Sharath Chandra Raparthy
Andrei Lupu
Eric Hambro
Aram H. Markosyan
...
Minqi Jiang
Jack Parker-Holder
Jakob Foerster
Tim Rocktaschel
Roberta Raileanu
SyDa
117
88
0
26 Feb 2024
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable
  Safety Detectors
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Zhexin Zhang
Yida Lu
Jingyuan Ma
Di Zhang
Rui Li
...
Hao Sun
Lei Sha
Zhifang Sui
Hongning Wang
Minlie Huang
49
31
0
26 Feb 2024
Immunization against harmful fine-tuning attacks
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
107
22
0
26 Feb 2024
Defending Large Language Models against Jailbreak Attacks via Semantic
  Smoothing
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
Jiabao Ji
Bairu Hou
Alexander Robey
George J. Pappas
Hamed Hassani
Yang Zhang
Eric Wong
Shiyu Chang
AAML
102
51
0
25 Feb 2024
PRP: Propagating Universal Perturbations to Attack Large Language Model
  Guard-Rails
PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
Neal Mangaokar
Ashish Hooda
Jihye Choi
Shreyas Chandrashekaran
Kassem Fawaz
Somesh Jha
Atul Prakash
AAML
84
37
0
24 Feb 2024
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
  Vision Paper
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
Daoyuan Wu
Shuaibao Wang
Yang Liu
Ning Liu
AAML
91
8
0
24 Feb 2024
Fine-Grained Detoxification via Instance-Level Prefixes for Large
  Language Models
Fine-Grained Detoxification via Instance-Level Prefixes for Large Language Models
Xin Yi
Linlin Wang
Xiaoling Wang
Liang He
MoMe
69
1
0
23 Feb 2024
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical
  Gradient Analysis
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
Yueqi Xie
Minghong Fang
Renjie Pi
Neil Zhenqiang Gong
113
36
0
21 Feb 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou
Yufei Han
Haomin Zhuang
Kehan Guo
Zhenwen Liang
Hongyan Bao
Xiangliang Zhang
LLMAGAAML
112
14
0
20 Feb 2024
Emulated Disalignment: Safety Alignment for Large Language Models May
  Backfire!
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
Zhanhui Zhou
Jie Liu
Zhichen Dong
Jiaheng Liu
Chao Yang
Wanli Ouyang
Yu Qiao
96
22
0
19 Feb 2024
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Leo Schwinn
David Dobre
Sophie Xhonneux
Gauthier Gidel
Stephan Gunnemann
AAML
141
49
0
14 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
178
92
0
13 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large
  Language Models
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
126
104
0
07 Feb 2024
The World of Generative AI: Deepfakes and Large Language Models
The World of Generative AI: Deepfakes and Large Language Models
Alakananda Mitra
Saraju Mohanty
Elias Kougianos
50
8
0
06 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
112
418
0
06 Feb 2024
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science
Xiangru Tang
Qiao Jin
Kunlun Zhu
Tongxin Yuan
Yichi Zhang
...
Jian Tang
Zhuosheng Zhang
Arman Cohan
Zhiyong Lu
Mark B. Gerstein
LLMAGELM
97
47
0
06 Feb 2024
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large
  Language Models
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong
Ondrej Bohdal
Tingyang Yu
Yongxin Yang
Timothy M. Hospedales
VLMMLLM
124
73
0
03 Feb 2024
Building Guardrails for Large Language Models
Building Guardrails for Large Language Models
Yizhen Dong
Ronghui Mu
Gao Jin
Yi Qi
Jinwei Hu
Xingyu Zhao
Jie Meng
Wenjie Ruan
Xiaowei Huang
OffRL
119
32
0
02 Feb 2024
Security and Privacy Challenges of Large Language Models: A Survey
Security and Privacy Challenges of Large Language Models: A Survey
B. Das
M. H. Amini
Yanzhao Wu
PILMELM
128
144
0
30 Jan 2024
Weak-to-Strong Jailbreaking on Large Language Models
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Y. Wang
136
62
0
30 Jan 2024
PsySafe: A Comprehensive Framework for Psychological-based Attack,
  Defense, and Evaluation of Multi-agent System Safety
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
Zaibin Zhang
Yongting Zhang
Lijun Li
Hongzhi Gao
Lijun Wang
Huchuan Lu
Feng Zhao
Yu Qiao
Jing Shao
LLMAG
86
41
0
22 Jan 2024
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan
Zhiwei He
Lingzhong Dong
Yiming Wang
Ruijie Zhao
...
Binglin Zhou
Fangqi Li
Zhuosheng Zhang
Rui Wang
Gongshen Liu
ELM
116
86
0
18 Jan 2024
Crowdsourced Adaptive Surveys
Crowdsourced Adaptive Surveys
Yamil Velez
20
1
0
16 Jan 2024
Malla: Demystifying Real-world Large Language Model Integrated Malicious
  Services
Malla: Demystifying Real-world Large Language Model Integrated Malicious Services
Zilong Lin
Jian Cui
Xiaojing Liao
Xiaofeng Wang
57
23
0
06 Jan 2024
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
AAML
83
24
0
19 Dec 2023
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
Anay Mehrotra
Manolis Zampetakis
Paul Kassianik
Blaine Nelson
Hyrum Anderson
Yaron Singer
Amin Karbasi
91
271
0
04 Dec 2023
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAMLLRM
83
104
0
13 Nov 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
155
709
0
12 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
155
259
0
05 Oct 2023
Unlocking Bias Detection: Leveraging Transformer-Based Models for
  Content Analysis
Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis
Shaina Raza
Oluwanifemi Bamgbose
Veronica Chatrath
Shardul Ghuge
Yan Sidyakin
Abdullah Y. Muaad
61
13
0
30 Sep 2023
SafetyBench: Evaluating the Safety of Large Language Models
SafetyBench: Evaluating the Safety of Large Language Models
Zhexin Zhang
Leqi Lei
Lindong Wu
Rui Sun
Yongkang Huang
Chong Long
Xiao Liu
Xuanyu Lei
Jie Tang
Minlie Huang
LRMLM&MAELM
121
112
0
13 Sep 2023
Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback
Helping the Helper: Supporting Peer Counselors via AI-Empowered Practice and Feedback
Shang-ling Hsu
Raj Sanjay Shah
Prathik Senthil
Zahra Ashktorab
Casey Dugan
Werner Geyer
Diyi Yang
116
24
0
15 May 2023
Previous
1234567