ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.06674
  4. Cited By
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

7 December 2023
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
Yuning Mao
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
    AI4MH
ArXivPDFHTML

Papers citing "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations"

50 / 289 papers shown
Title
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xiaotian Zhang
AAML
149
0
0
27 Feb 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Shiyu Xiang
Ansen Zhang
Yanfei Cao
Yang Fan
Ronghao Chen
AAML
62
0
0
26 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Guohao Li
Philip H. S. Torr
Adel Bibi
53
1
0
26 Feb 2025
Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
Matthew Barker
Andrew Bell
Evan Thomas
James Carr
Thomas Andrews
Umang Bhatt
87
1
0
25 Feb 2025
Policy-as-Prompt: Rethinking Content Moderation in the Age of Large Language Models
Policy-as-Prompt: Rethinking Content Moderation in the Age of Large Language Models
Konstantina Palla
José Luis Redondo García
C. Hauff
Francesco Fabbri
Henrik Lindström
Daniel R. Taber
Andreas Damianou
M. Lalmas
AILaw
67
0
0
25 Feb 2025
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Giulio Zizzo
Giandomenico Cornacchia
Kieran Fraser
Muhammad Zaid Hameed
Ambrish Rawat
Beat Buesser
Mark Purcell
Pin-Yu Chen
P. Sattigeri
Kush R. Varshney
AAML
43
2
0
24 Feb 2025
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Zhexin Zhang
Leqi Lei
Junxiao Yang
Xijie Huang
Yida Lu
...
Xianqi Lei
C. Pan
Lei Sha
Hairu Wang
Minlie Huang
AAML
48
0
0
24 Feb 2025
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
Maciej Chrabąszcz
Filip Szatkowski
Bartosz Wójcik
Jan Dubiñski
Tomasz Trzciñski
54
0
0
22 Feb 2025
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
Lijun Li
Zhelun Shi
Xuhao Hu
Bowen Dong
Yiran Qin
Xihui Liu
Lu Sheng
Jing Shao
114
1
0
21 Feb 2025
Drift: Decoding-time Personalized Alignments with Implicit User Preferences
Drift: Decoding-time Personalized Alignments with Implicit User Preferences
Minbeom Kim
Kang-il Lee
Seongho Joo
Hwaran Lee
Thibaut Thonet
Kyomin Jung
AI4TS
121
1
0
20 Feb 2025
RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering
RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering
Sichu Liang
Linhai Zhang
Hongyu Zhu
Wenwen Wang
Yulan He
Deyu Zhou
RALM
48
0
0
19 Feb 2025
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models
Huawei Lin
Yingjie Lao
Tong Geng
Tan Yu
Weijie Zhao
AAML
SILM
79
2
0
18 Feb 2025
Computational Safety for Generative AI: A Signal Processing Perspective
Computational Safety for Generative AI: A Signal Processing Perspective
Pin-Yu Chen
76
1
0
18 Feb 2025
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Yi Wang
Fenghua Weng
Songlin Yang
Zhan Qin
Minlie Huang
Wenjie Wang
KELM
AAML
53
0
0
17 Feb 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang
Zhangchen Xu
Yuetai Li
Luyao Niu
Zhen Xiang
Bo-wen Li
Bill Yuchen Lin
Radha Poovendran
KELM
ELM
LRM
83
14
0
17 Feb 2025
Prompt Inject Detection with Generative Explanation as an Investigative Tool
Prompt Inject Detection with Generative Explanation as an Investigative Tool
Jonathan Pan
Swee Liang Wong
Yidi Yuan
Xin Wei Chia
SILM
51
0
0
16 Feb 2025
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
Ang Li
Yin Zhou
Vethavikashini Chithrra Raghuram
Tom Goldstein
Micah Goldblum
AAML
83
7
0
12 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAML
ELM
61
3
0
04 Feb 2025
Peering Behind the Shield: Guardrail Identification in Large Language Models
Peering Behind the Shield: Guardrail Identification in Large Language Models
Ziqing Yang
Yixin Wu
Rui Wen
Michael Backes
Yang Zhang
63
1
0
03 Feb 2025
GuardReasoner: Towards Reasoning-based LLM Safeguards
Yue Liu
Hongcheng Gao
Shengfang Zhai
Jun-Xiong Xia
Tianyi Wu
Zhiwei Xue
Y. Chen
Kenji Kawaguchi
Jiaheng Zhang
Bryan Hooi
AI4TS
LRM
131
14
0
30 Jan 2025
Smoothed Embeddings for Robust Language Models
Smoothed Embeddings for Robust Language Models
Ryo Hase
Md. Rafi Ur Rashid
Ashley Lewis
Jing Liu
T. Koike-Akino
K. Parsons
Yunhong Wang
AAML
46
0
0
27 Jan 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Melissa Kazemi Rad
Huy Nghiem
Andy Luo
Sahil Wadhwa
Mohammad Sorower
Stephen Rawls
AAML
93
2
0
22 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
82
44
0
20 Jan 2025
From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs
From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs
Hrithik Majumdar Shibu
Shrestha Datta
Md. Sumon Miah
Nasrullah Sami
Mahruba Sharmin Chowdhury
Md. Saiful Islam
64
0
0
17 Jan 2025
Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions
Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions
Doaa Mahmud
Hadeel Hajmohamed
Shamma Almentheri
Shamma Alqaydi
Lameya Aldhaheri
R. A. Khalil
Nasir Saeed
AI4TS
38
5
0
08 Jan 2025
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Anna Kołos
Katarzyna Lorenc
Emilia Wisnios
Agnieszka Karlinska
36
0
0
08 Jan 2025
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
Fengxiang Wang
Ranjie Duan
Peng Xiao
Xiaojun Jia
Shiji Zhao
...
Hang Su
Jialing Tao
Hui Xue
Jun Zhu
Hui Xue
LLMAG
64
7
0
08 Jan 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and
  Multi-step Reinforcement Learning
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
3
0
24 Dec 2024
The Evolution of LLM Adoption in Industry Data Curation Practices
The Evolution of LLM Adoption in Industry Data Curation Practices
Crystal Qian
Michael Xieyang Liu
Emily Reif
Grady Simon
Nada Hussein
Nathan Clement
James Wexler
Carrie J. Cai
Michael Terry
Minsuk Kahng
AILaw
ELM
77
4
0
20 Dec 2024
JailPO: A Novel Black-box Jailbreak Framework via Preference
  Optimization against Aligned LLMs
JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs
Hao Li
Jiawei Ye
Jie Wu
Tianjie Yan
Chu Wang
Zhixin Li
AAML
72
0
0
20 Dec 2024
Lightweight Safety Classification Using Pruned Language Models
Lightweight Safety Classification Using Pruned Language Models
Mason Sawtell
Tula Masterman
Sandi Besen
Jim Brown
94
2
0
18 Dec 2024
SpearBot: Leveraging Large Language Models in a Generative-Critique
  Framework for Spear-Phishing Email Generation
SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation
Qinglin Qi
Yun Luo
Yijia Xu
Wenbo Guo
Yong Fang
AAML
86
2
0
15 Dec 2024
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation
Runtao Liu
Chen I Chieh
Jindong Gu
Jipeng Zhang
Renjie Pi
Qifeng Chen
Philip H. S. Torr
Ashkan Khakzar
Fabio Pizzati
EGVM
109
0
0
13 Dec 2024
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI
  Policy, Research, and Practice
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice
A. Feder Cooper
Christopher A. Choquette-Choo
Miranda Bogen
Matthew Jagielski
Katja Filippova
...
Abigail Z. Jacobs
Andreas Terzis
Hanna M. Wallach
Nicolas Papernot
Katherine Lee
AILaw
MU
93
10
0
09 Dec 2024
Towards Data Governance of Frontier AI Models
Towards Data Governance of Frontier AI Models
Jason Hausenloy
Duncan McClements
Madhavendra Thakur
74
1
0
05 Dec 2024
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods
  and a New Transcript-Classifier Approach
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
T. T. Wang
John Hughes
Henry Sleight
Rylan Schaeffer
Rajashree Agrawal
Fazl Barez
Mrinank Sharma
Jesse Mu
Nir Shavit
Ethan Perez
AAML
92
4
0
03 Dec 2024
Time-Reversal Provides Unsupervised Feedback to LLMs
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun
Rahul Madhavan
Sravanti Addepalli
A. Suggala
Karthikeyan Shanmugam
Prateek Jain
LRM
SyDa
64
0
0
03 Dec 2024
Sensitive Content Classification in Social Media: A Holistic Resource
  and Evaluation
Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation
Dimosthenis Antypas
Indira Sen
Carla Pérez-Almendros
Jose Camacho-Collados
Francesco Barbieri
69
1
0
29 Nov 2024
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for
  Jailbreaking Vision-Language Models
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao
Bryan Hooi
Jiaheng Liu
Kai-Wei Chang
Zi Huang
Yujun Cai
AAML
92
1
0
27 Nov 2024
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness
Avinash Amballa
Durga Sandeep Saluru
Gayathri Akkinapalli
Abhishek Sureddy
Akshay Kumar Sureddy
ALM
90
0
0
26 Nov 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
120
67
0
25 Nov 2024
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain
Haochen Zhao
Xiangru Tang
Ziran Yang
Xiao Han
Xuanzhi Feng
...
Senhao Cheng
Di Jin
Yilun Zhao
Arman Cohan
Mark B. Gerstein
ELM
83
1
0
23 Nov 2024
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
Aaron Zheng
Mansi Rana
Andreas Stolcke
75
1
0
21 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Xianpei Han
...
Le Sun
Jie Lou
Bowen Yu
Yaojie Lu
Hongyu Lin
ALM
83
2
0
18 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang
Xuehai Tang
Jizhong Han
Songlin Hu
68
0
0
18 Nov 2024
Bias in Large Language Models: Origin, Evaluation, and Mitigation
Yufei Guo
Muzhe Guo
Juntao Su
Zhou Yang
Mengqiu Zhu
Hongfei Li
Mengyang Qiu
Shuo Shuo Liu
AILaw
30
9
0
16 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding
  Conversations
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM
3DH
49
20
0
15 Nov 2024
Jailbreak Attacks and Defenses against Multimodal Generative Models: A
  Survey
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey
Xuannan Liu
Xing Cui
Peipei Li
Zekun Li
Huaibo Huang
Shuhan Xia
Miaoxuan Zhang
Yueying Zou
Ran He
AAML
67
8
0
14 Nov 2024
PyGen: A Collaborative Human-AI Approach to Python Package Creation
PyGen: A Collaborative Human-AI Approach to Python Package Creation
Saikat Barua
Mostafizur Rahman
Md Jafor Sadek
Rafiul Islam
Shehnaz Khaled
Md. Shohrab Hossain
44
1
0
13 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and
  Outlook
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAML
ELM
PILM
61
1
0
12 Nov 2024
Previous
123456
Next