Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2010.07079
Cited By
Recipes for Safety in Open-domain Chatbots
14 October 2020
Jing Xu
Da Ju
Margaret Li
Y-Lan Boureau
Jason Weston
Emily Dinan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Recipes for Safety in Open-domain Chatbots"
50 / 50 papers shown
Title
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
47
0
0
28 Apr 2025
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
84
1
0
09 Oct 2024
From Pixels to Personas: Investigating and Modeling Self-Anthropomorphism in Human-Robot Dialogues
Yu Li
Devamanyu Hazarika
Di Jin
Julia Hirschberg
Yang Liu
28
0
0
04 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
46
6
0
04 Oct 2024
Understanding Generative AI Content with Embedding Models
Max Vargas
Reilly Cannon
A. Engel
Anand D. Sarwate
Tony Chiang
52
3
0
19 Aug 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
36
6
0
21 Nov 2023
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAML
LRM
22
93
0
13 Nov 2023
GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives
Vinodkumar Prabhakaran
Christopher Homan
Lora Aroyo
Aida Mostafazadeh Davani
Alicia Parrish
Alex S. Taylor
Mark Díaz
Ding Wang
Greg Serapio-García
37
9
0
09 Nov 2023
Evaluating Chatbots to Promote Users' Trust -- Practices and Open Problems
Biplav Srivastava
Kausik Lakkaraju
T. Koppel
Vignesh Narayanan
Ashish Kundu
Sachindra Joshi
31
2
0
09 Sep 2023
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
S. Feizi
Himabindu Lakkaraju
AAML
27
164
0
06 Sep 2023
AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models
Siheng Li
Cheng Yang
Yichun Yin
Xinyu Zhu
Ze-Long Cheng
Lifeng Shang
Xin Jiang
Qun Liu
Yujiu Yang
SyDa
32
3
0
12 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Pinjia He
Shuming Shi
Zhaopeng Tu
SILM
76
232
0
12 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
105
11,007
0
18 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
86
839
0
05 Jul 2023
Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety
Christopher Homan
Greg Serapio-García
Lora Aroyo
Mark Díaz
Alicia Parrish
Vinodkumar Prabhakaran
Alex S. Taylor
Ding Wang
22
9
0
20 Jun 2023
Model-Based Simulation for Optimising Smart Reply
Benjamin Towle
Ke Zhou
32
1
0
26 May 2023
Reducing Sensitivity on Speaker Names for Text Generation from Dialogues
Qi Jia
Haifeng Tang
Kenny Q. Zhu
24
2
0
23 May 2023
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier García Gilabert
Carlos Escolano
Marta R. Costa-jussá
CLL
MU
26
2
0
19 May 2023
Prompted LLMs as Chatbot Modules for Long Open-domain Conversation
Gibbeum Lee
Volker Hartmann
Jongho Park
Dimitris Papailiopoulos
Kangwook Lee
24
62
0
08 May 2023
Learn What Is Possible, Then Choose What Is Best: Disentangling One-To-Many Relations in Language Through Text-based Games
Benjamin Towle
Ke Zhou
SyDa
25
4
0
14 Apr 2023
Evaluating Human-Language Model Interaction
Mina Lee
Megha Srivastava
Amelia Hardy
John Thickstun
Esin Durmus
...
Hancheng Cao
Tony Lee
Rishi Bommasani
Michael S. Bernstein
Percy Liang
LM&MA
ALM
58
99
0
19 Dec 2022
On Safe and Usable Chatbots for Promoting Voter Participation
Bharath Muppasani
Vishal Pallagani
Kausik Lakkaraju
Shuge Lei
Biplav Srivastava
Brett W. Robertson
Andrea A. Hickerson
Vignesh Narayanan
21
2
0
16 Dec 2022
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
82
1,477
0
15 Dec 2022
RHO (
ρ
ρ
ρ
): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding
Ziwei Ji
Zihan Liu
Nayeon Lee
Tiezheng Yu
Bryan Wilie
Mini Zeng
Pascale Fung
HILM
23
53
0
03 Dec 2022
Risk-graded Safety for Handling Medical Queries in Conversational AI
Gavin Abercrombie
Verena Rieser
AI4MH
38
11
0
02 Oct 2022
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Waiman Si
Michael Backes
Jeremy Blackburn
Emiliano De Cristofaro
Gianluca Stringhini
Savvas Zannettou
Yang Zhang
36
58
0
07 Sep 2022
Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent
Ethan A. Chi
Ashwin Paranjape
A. See
Caleb Chiam
Trenton Chang
...
Dilara Soylu
Jillian Tang
A. Narayan
Giovanni Campagna
Christopher D. Manning
39
7
0
25 Jul 2022
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
Hung Le
Yue Wang
Akhilesh Deepak Gotmare
Silvio Savarese
S. Hoi
SyDa
ALM
129
240
0
05 Jul 2022
DIRECTOR: Generator-Classifiers For Supervised Language Modeling
Kushal Arora
Kurt Shuster
Sainbayar Sukhbaatar
Jason Weston
VLM
30
40
0
15 Jun 2022
Resolving the Human Subjects Status of Machine Learning's Crowdworkers
Divyansh Kaushik
Zachary Chase Lipton
A. London
25
2
0
08 Jun 2022
Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation
Prakhar Gupta
Harsh Jhamtani
Jeffrey P. Bigham
49
12
0
19 May 2022
OPT: Open Pre-trained Transformer Language Models
Susan Zhang
Stephen Roller
Naman Goyal
Mikel Artetxe
Moya Chen
...
Daniel Simig
Punit Singh Koura
Anjali Sridhar
Tianlu Wang
Luke Zettlemoyer
VLM
OSLM
AI4CE
59
3,488
0
02 May 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
72
2,330
0
12 Apr 2022
Korean Online Hate Speech Dataset for Multilabel Classification: How Can Social Science Improve Dataset on Hate Speech?
Taeyoung Kang
Eunrang Kwon
Junbum Lee
Youngeun Nam
Junmo Song
JeongKyu Suh
11
8
0
07 Apr 2022
Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparative Study
Serra Sinem Tekiroğlu
Helena Bonaldi
Margherita Fanton
Marco Guerini
24
43
0
04 Apr 2022
PanGu-Bot: Efficient Generative Dialogue Pre-training from Pre-trained Language Model
Fei Mi
Yitong Li
Yulong Zeng
Jingyan Zhou
Yasheng Wang
Chuanfei Xu
Lifeng Shang
Xin Jiang
Shiqi Zhao
Qun Liu
ALM
42
18
0
31 Mar 2022
EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training
Yuxian Gu
Jiaxin Wen
Hao Sun
Yi Song
Pei Ke
...
Zheng Zhang
Jianzhu Yao
Lei Liu
Xiaoyan Zhu
Minlie Huang
21
55
0
17 Mar 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
A Literature Survey of Recent Advances in Chatbots
Guendalina Caldarini
Sardar F. Jaf
K. McGarry
AI4CE
35
274
0
17 Jan 2022
Automatic Evaluation and Moderation of Open-domain Dialogue Systems
Chen Zhang
João Sedoc
L. F. D’Haro
Rafael E. Banchs
Alexander I. Rudnicky
22
36
0
03 Nov 2021
Investigating Robustness of Dialog Models to Popular Figurative Language Constructs
Harsh Jhamtani
Varun Gangal
Eduard H. Hovy
Taylor Berg-Kirkpatrick
28
21
0
01 Oct 2021
Automatically Exposing Problems with Neural Dialog Models
Dian Yu
Kenji Sagae
31
9
0
14 Sep 2021
Proto: A Neural Cocktail for Generating Appealing Conversations
Sougata Saha
Souvik Das
Elizabeth Soper
Erin Pacquetet
R. Srihari
26
12
0
06 Sep 2021
Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts
Ashutosh Baheti
Maarten Sap
Alan Ritter
Mark O. Riedl
21
84
0
26 Aug 2021
Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling
Emily Dinan
Gavin Abercrombie
A. S. Bergman
Shannon L. Spruit
Dirk Hovy
Y-Lan Boureau
Verena Rieser
43
105
0
07 Jul 2021
Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning
Maxwell Nye
Michael Henry Tessler
J. Tenenbaum
Brenden Lake
33
117
0
06 Jul 2021
Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets
Irene Solaiman
Christy Dennison
30
222
0
18 Jun 2021
Detoxifying Language Models Risks Marginalizing Minority Voices
Albert Xu
Eshaan Pathak
Eric Wallace
Suchin Gururangan
Maarten Sap
Dan Klein
13
121
0
13 Apr 2021
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection
Bertie Vidgen
Tristan Thrush
Zeerak Talat
Douwe Kiela
23
242
0
31 Dec 2020
GeDi: Generative Discriminator Guided Sequence Generation
Ben Krause
Akhilesh Deepak Gotmare
Bryan McCann
N. Keskar
Shafiq R. Joty
R. Socher
Nazneen Rajani
56
389
0
14 Sep 2020
1