Concrete Problems in AI Safety

21 June 2016

Papers citing "Concrete Problems in AI Safety"

50 / 479 papers shown

Title
Guiding Pretraining in Reinforcement Learning with Large Language Models Yuqing Du Olivia Watkins Zihan Wang Cédric Colas Trevor Darrell Pieter Abbeel Abhishek Gupta Jacob Andreas LM&Ro 25 175 0 13 Feb 2023
Probabilistic Circuits That Know What They Don't Know Fabrizio G. Ventola Steven Braun Zhongjie Yu Martin Mundt Kristian Kersting UQCV TPM 32 7 0 13 Feb 2023
Unlabeled Imperfect Demonstrations in Adversarial Imitation Learning Yunke Wang Bo Du Chang Xu 38 8 0 13 Feb 2023
Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning Elizaveta Tennant Stephen Hailes Mirco Musolesi 19 17 0 20 Jan 2023
Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes Justin Reppert Ben Rachbach Charlie George Luke Stebbing Ju-Seung Byun Maggie Appleton Andreas Stuhlmuller ReLM LRM 43 17 0 04 Jan 2023
Don't do it: Safer Reinforcement Learning With Rule-based Guidance Ekaterina Nikonova Cheng Xue Jochen Renz 32 0 0 28 Dec 2022
Methodological reflections for AI alignment research using human feedback Thilo Hagendorff Sarah Fabi 24 6 0 22 Dec 2022
Circumventing interpretability: How to defeat mind-readers Lee D. Sharkey 35 3 0 21 Dec 2022
Target Conditioned Representation Independence (TCRI); From Domain-Invariant to Domain-General Representations Olawale Salaudeen Oluwasanmi Koyejo 32 2 0 21 Dec 2022
Discovering Language Model Behaviors with Model-Written Evaluations Ethan Perez Sam Ringer Kamilė Lukošiūtė Karina Nguyen Edwin Chen ... Danny Hernandez Deep Ganguli Evan Hubinger Nicholas Schiefer Jared Kaplan ALM 22 367 0 19 Dec 2022
MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations Nicklas Hansen Yixin Lin H. Su Xiaolong Wang Vikash Kumar Aravind Rajeswaran OffRL 32 49 0 12 Dec 2022
Targeted Adversarial Attacks on Deep Reinforcement Learning Policies via Model Checking Dennis Gross T. D. Simão N. Jansen G. Pérez AAML 46 2 0 10 Dec 2022
Online Shielding for Reinforcement Learning Bettina Könighofer Julian Rudolf Alexander Palmisano Martin Tappler Roderick Bloem OffRL 14 21 0 04 Dec 2022
Melting Pot 2.0 J. Agapiou A. Vezhnevets Edgar A. Duénez-Guzmán Jayd Matyas Yiran Mao ... Sukhdeep Singh Julia Haas Igor Mordatch D. Mobbs Joel Z Leibo 45 32 0 24 Nov 2022
A Brief Overview of AI Governance for Responsible Machine Learning Systems Navdeep Gill Abhishek Mathur Marcos V. Conde 29 5 0 21 Nov 2022
Reward Gaming in Conditional Text Generation Richard Yuanzhe Pang Vishakh Padmakumar Thibault Sellam Ankur P. Parikh He He 35 24 0 16 Nov 2022
Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning Katherine Metcalf Miguel Sarabia B. Theobald OffRL 38 4 0 12 Nov 2022
Calibrated Perception Uncertainty Across Objects and Regions in Bird's-Eye-View Markus Kängsepp Meelis Kull UQCV 15 4 0 08 Nov 2022
Progress and summary of reinforcement learning on energy management of MPS-EV Jincheng Hu Yang Lin Liang Chu Zhuoran Hou Jihan Li Jingjing Jiang Yuanjian Zhang 23 12 0 08 Nov 2022
Interpreting deep learning output for out-of-distribution detection Damian J. Matuszewski I. Sintorn OODD 32 1 0 07 Nov 2022
Measuring Progress on Scalable Oversight for Large Language Models Sam Bowman Jeeyoon Hyun Ethan Perez Edwin Chen Craig Pettit ... Tristan Hume Yuntao Bai Zac Hatfield-Dodds Benjamin Mann Jared Kaplan ALM ELM 28 123 0 04 Nov 2022
Reward Shaping Using Convolutional Neural Network Hani Sami Hadi Otrok Jamal Bentahar Azzam Mourad Ernesto Damiani 29 3 0 30 Oct 2022
Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities Jasmina Gajcin Ivana Dusparic CML OffRL 35 8 0 21 Oct 2022
Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving Eli Bronstein Mark Palatucci Dominik Notz Brandyn White Alex Kuefler ... Punit Shah Evan Racah Benjamin Frenkel Shimon Whiteson Drago Anguelov 47 58 0 18 Oct 2022
Learning Control Admissibility Models with Graph Neural Networks for Multi-Agent Navigation Chenning Yu Hong-Den Yu Sicun Gao 42 17 0 17 Oct 2022
Prompting GPT-3 To Be Reliable Chenglei Si Zhe Gan Zhengyuan Yang Shuohang Wang Jianfeng Wang Jordan L. Boyd-Graber Lijuan Wang KELM LRM 60 283 0 17 Oct 2022
Microscopy is All You Need Sergei V. Kalinin Rama K Vasudevan Yongtao Liu Ayana Ghosh Kevin M. Roccapriore M. Ziatdinov 30 0 0 12 Oct 2022
Robust Models are less Over-Confident Julia Grabinski Paul Gavrikov J. Keuper M. Keuper AAML 36 24 0 12 Oct 2022
Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees Dorde Zikelic Mathias Lechner T. Henzinger K. Chatterjee 24 22 0 11 Oct 2022
Artificial virtuous agents in a multiagent tragedy of the commons Jakob Stenseke 32 6 0 06 Oct 2022
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals Rohin Shah Vikrant Varma Ramana Kumar Mary Phuong Victoria Krakovna J. Uesato Zachary Kenton 40 68 0 04 Oct 2022
ROAD-R: The Autonomous Driving Dataset with Logical Requirements Eleonora Giunchiglia Mihaela C. Stoian Salman Khan Fabio Cuzzolin Thomas Lukasiewicz AI4TS 47 31 0 04 Oct 2022
Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation Yannick Hogewind T. D. Simão Tal Kachman N. Jansen 21 10 0 02 Oct 2022
GaIA: Graphical Information Gain based Attention Network for Weakly Supervised Point Cloud Semantic Segmentation Min Seok Lee Seok Woo Yang S. W. Han 3DPC 22 21 0 02 Oct 2022
Causal Proxy Models for Concept-Based Model Explanations Zhengxuan Wu Karel DÓosterlinck Atticus Geiger Amir Zur Christopher Potts MILM 83 35 0 28 Sep 2022
Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning Zhengwei Fang Rui Wang Tao Huang L. Jing AAML 40 5 0 24 Sep 2022
Extremely Simple Activation Shaping for Out-of-Distribution Detection Andrija Djurisic Nebojsa Bozanic Arjun Ashok Rosanne Liu OODD 172 151 0 20 Sep 2022
An information-theoretic perspective on intrinsic motivation in reinforcement learning: a survey A. Aubret L. Matignon S. Hassas 39 35 0 19 Sep 2022
A Unifying Framework for Online Optimization with Long-Term Constraints Matteo Castiglioni A. Celli A. Marchesi Giulia Romano N. Gatti 25 34 0 15 Sep 2022
The Alignment Problem from a Deep Learning Perspective Richard Ngo Lawrence Chan Sören Mindermann 68 183 0 30 Aug 2022
SAFE: Sensitivity-Aware Features for Out-of-Distribution Object Detection Samuel Wilson Tobias Fischer Feras Dayoub Dimity Miller Niko Sünderhauf OODD 31 29 0 29 Aug 2022
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey Julian Michael Ari Holtzman Alicia Parrish Aaron Mueller Alex Jinpeng Wang ... Divyam Madaan Nikita Nangia Richard Yuanzhe Pang Jason Phang Sam Bowman 30 37 0 26 Aug 2022
Calibrated Selective Classification Adam Fisch Tommi Jaakkola Regina Barzilay 31 17 0 25 Aug 2022
Learning Task Automata for Reinforcement Learning using Hidden Markov Models Alessandro Abate Y. Almulla James Fox David Hyland Michael Wooldridge OffRL 30 6 0 25 Aug 2022
Towards Augmented Microscopy with Reinforcement Learning-Enhanced Workflows Michael Xu Abinash Kumar J. Lebeau 18 7 0 04 Aug 2022
Out-of-Distribution Detection with Semantic Mismatch under Masking Yijun Yang Ruiyuan Gao Qiang Xu OODD 24 27 0 31 Jul 2022
Sample-efficient Safe Learning for Online Nonlinear Control with Control Barrier Functions Wenhao Luo Wen Sun Ashish Kapoor OffRL 43 9 0 29 Jul 2022
Membership Inference Attacks via Adversarial Examples Hamid Jalalzai Elie Kadoche Rémi Leluc Vincent Plassier AAML FedML MIACV 45 7 0 27 Jul 2022
Active Exploration for Inverse Reinforcement Learning David Lindner Andreas Krause Giorgia Ramponi 29 24 0 18 Jul 2022
Reinforcement Learning For Survival, A Clinically Motivated Method For Critically Ill Patients Thesath Nanayakkara OOD OffRL 24 0 0 17 Jul 2022