ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.02392
  4. Cited By
Robust Conversational Agents against Imperceptible Toxicity Triggers

Robust Conversational Agents against Imperceptible Toxicity Triggers

5 May 2022
Ninareh Mehrabi
Ahmad Beirami
Fred Morstatter
Aram Galstyan
    AAML
ArXivPDFHTML

Papers citing "Robust Conversational Agents against Imperceptible Toxicity Triggers"

19 / 19 papers shown
Title
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models
Nastaran Darabi
Devashri Naik
Sina Tayebati
Dinithi Jayasuriya
Ranganath Krishnan
A. R. Trivedi
AAML
52
0
0
24 Feb 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and
  Multi-step Reinforcement Learning
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
3
0
24 Dec 2024
Are Language Models Agnostic to Linguistically Grounded Perturbations? A
  Case Study of Indic Languages
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
Poulami Ghosh
Raj Dabre
Pushpak Bhattacharyya
AAML
75
0
0
14 Dec 2024
Decoding Hate: Exploring Language Models' Reactions to Hate Speech
Decoding Hate: Exploring Language Models' Reactions to Hate Speech
Paloma Piot
Javier Parapar
43
1
0
01 Oct 2024
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov
  Decision Processes and Tree Search
Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search
Robert J. Moss
AAML
26
0
0
11 Aug 2024
ESCoT: Towards Interpretable Emotional Support Dialogue Systems
ESCoT: Towards Interpretable Emotional Support Dialogue Systems
Tenggan Zhang
Xinjie Zhang
Jinming Zhao
Li Zhou
Qin Jin
34
8
0
16 Jun 2024
White-box Multimodal Jailbreaks Against Large Vision-Language Models
White-box Multimodal Jailbreaks Against Large Vision-Language Models
Ruofan Wang
Xingjun Ma
Hanxu Zhou
Chuanjun Ji
Guangnan Ye
Yu-Gang Jiang
AAML
VLM
49
17
0
28 May 2024
Gradient-Based Language Model Red Teaming
Gradient-Based Language Model Red Teaming
Nevan Wichers
Carson E. Denison
Ahmad Beirami
19
25
0
30 Jan 2024
JAB: Joint Adversarial Prompting and Belief Augmentation
JAB: Joint Adversarial Prompting and Belief Augmentation
Ninareh Mehrabi
Palash Goyal
Anil Ramakrishna
Jwala Dhamala
Shalini Ghosh
Richard Zemel
Kai-Wei Chang
Aram Galstyan
Rahul Gupta
AAML
33
7
0
16 Nov 2023
Prompt have evil twins
Prompt have evil twins
Rimon Melamed
Lucas H. McCabe
T. Wakhare
Yejin Kim
H. H. Huang
Enric Boix-Adsera
36
3
0
13 Nov 2023
Break it, Imitate it, Fix it: Robustness by Generating Human-Like
  Attacks
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Aradhana Sinha
Ananth Balashankar
Ahmad Beirami
Thi Avrahami
Jilin Chen
Alex Beutel
AAML
27
4
0
25 Oct 2023
Privacy Preserving Large Language Models: ChatGPT Case Study Based
  Vision and Framework
Privacy Preserving Large Language Models: ChatGPT Case Study Based Vision and Framework
Imdad Ullah
Najm Hassan
S. Gill
Basem Suleiman
T. Ahanger
Zawar Shah
Junaid Qadir
S. Kanhere
40
16
0
19 Oct 2023
FLIRT: Feedback Loop In-context Red Teaming
FLIRT: Feedback Loop In-context Red Teaming
Ninareh Mehrabi
Palash Goyal
Christophe Dupuy
Qian Hu
Shalini Ghosh
R. Zemel
Kai-Wei Chang
Aram Galstyan
Rahul Gupta
DiffM
23
55
0
08 Aug 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
25
138
0
22 Jun 2023
Run Like a Girl! Sports-Related Gender Bias in Language and Vision
Run Like a Girl! Sports-Related Gender Bias in Language and Vision
S. Harrison
Eleonora Gualdoni
Gemma Boleda
24
6
0
23 May 2023
Learn What NOT to Learn: Towards Generative Safety in Chatbots
Learn What NOT to Learn: Towards Generative Safety in Chatbots
Leila Khalatbari
Yejin Bang
Dan Su
Willy Chung
Saeedeh Ghadimi
Hossein Sameti
Pascale Fung
33
7
0
21 Apr 2023
Language Model Behavior: A Comprehensive Survey
Language Model Behavior: A Comprehensive Survey
Tyler A. Chang
Benjamin Bergen
VLM
LRM
LM&MA
27
103
0
20 Mar 2023
Towards Safer Generative Language Models: A Survey on Safety Risks,
  Evaluations, and Improvements
Towards Safer Generative Language Models: A Survey on Safety Risks, Evaluations, and Improvements
Jiawen Deng
Jiale Cheng
Hao Sun
Zhexin Zhang
Minlie Huang
LM&MA
ELM
34
16
0
18 Feb 2023
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain
  Chatbots
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Waiman Si
Michael Backes
Jeremy Blackburn
Emiliano De Cristofaro
Gianluca Stringhini
Savvas Zannettou
Yang Zhang
36
58
0
07 Sep 2022
1