Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.02392
Cited By
Robust Conversational Agents against Imperceptible Toxicity Triggers
5 May 2022
Ninareh Mehrabi
Ahmad Beirami
Fred Morstatter
Aram Galstyan
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Robust Conversational Agents against Imperceptible Toxicity Triggers"
19 / 19 papers shown
Title
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
Nastaran Darabi
Devashri Naik
Sina Tayebati
Dinithi Jayasuriya
Ranganath Krishnan
A. R. Trivedi
AAML
52
0
0
24 Feb 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
3
0
24 Dec 2024
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
Poulami Ghosh
Raj Dabre
Pushpak Bhattacharyya
AAML
75
0
0
14 Dec 2024
Decoding Hate: Exploring Language Models' Reactions to Hate Speech
Paloma Piot
Javier Parapar
43
1
0
01 Oct 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
26
0
0
11 Aug 2024
ESCoT: Towards Interpretable Emotional Support Dialogue Systems
Tenggan Zhang
Xinjie Zhang
Jinming Zhao
Li Zhou
Qin Jin
34
8
0
16 Jun 2024
White-box Multimodal Jailbreaks Against Large Vision-Language Models
Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu-Gang Jiang
AAML
VLM
49
17
0
28 May 2024
Gradient-Based Language Model Red Teaming
Nevan Wichers
Carson E. Denison
Ahmad Beirami
19
25
0
30 Jan 2024
JAB: Joint Adversarial Prompting and Belief Augmentation
Ninareh Mehrabi
Palash Goyal
Anil Ramakrishna
Jwala Dhamala
Shalini Ghosh
Richard Zemel
Kai-Wei Chang
Aram Galstyan
Rahul Gupta
AAML
33
7
0
16 Nov 2023
Prompt have evil twins
Rimon Melamed
Lucas H. McCabe
T. Wakhare
Yejin Kim
H. H. Huang
Enric Boix-Adsera
36
3
0
13 Nov 2023
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Aradhana Sinha
Ananth Balashankar
Ahmad Beirami
Thi Avrahami
Jilin Chen
Alex Beutel
AAML
27
4
0
25 Oct 2023
Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework
Imdad Ullah
Najm Hassan
S. Gill
Basem Suleiman
T. Ahanger
Zawar Shah
Junaid Qadir
S. Kanhere
40
16
0
19 Oct 2023
FLIRT: Feedback Loop In-context Red Teaming
Ninareh Mehrabi
Palash Goyal
Christophe Dupuy
Qian Hu
Shalini Ghosh
R. Zemel
Kai-Wei Chang
Aram Galstyan
Rahul Gupta
DiffM
23
55
0
08 Aug 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
25
138
0
22 Jun 2023
Run Like a Girl! Sports-Related Gender Bias in Language and Vision
S. Harrison
Eleonora Gualdoni
Gemma Boleda
24
6
0
23 May 2023
Learn What NOT to Learn: Towards Generative Safety in Chatbots
Leila Khalatbari
Yejin Bang
Dan Su
Willy Chung
Saeedeh Ghadimi
Hossein Sameti
Pascale Fung
33
7
0
21 Apr 2023
Language Model Behavior: A Comprehensive Survey
Tyler A. Chang
Benjamin Bergen
VLM
LRM
LM&MA
27
103
0
20 Mar 2023
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements
Jiawen Deng
Jiale Cheng
Hao Sun
Zhexin Zhang
Minlie Huang
LM&MA
ELM
34
16
0
18 Feb 2023
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Waiman Si
Michael Backes
Jeremy Blackburn
Emiliano De Cristofaro
Gianluca Stringhini
Savvas Zannettou
Yang Zhang
36
58
0
07 Sep 2022
1