Robust Feature-Level Adversaries are Interpretability Tools

7 October 2021

Stephen Casper

Max Nadeau

Dylan Hadfield-Menell

Gabriel Kreiman

AAML

ArXiv PDF HTML

Papers citing "Robust Feature-Level Adversaries are Interpretability Tools"

24 / 24 papers shown

Title
A Survey of Adversarial Defenses in Vision-based Systems: Categorization, Methods and Challenges Nandish Chattopadhyay Abdul Basit B. Ouni Muhammad Shafique AAML 31 0 0 01 Mar 2025
MERIT: Multi-view evidential learning for reliable and interpretable liver fibrosis staging Yuanye Liu Zheyao Gao Nannan Shi Fuping Wu Yuxin Shi Qingchao Chen Xiahai Zhuang 27 2 0 05 May 2024
The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability Stephen Casper Jieun Yun Joonhyuk Baek Yeseong Jung Minhwan Kim ... A. Nicolson Arush Tagade Jessica Rumbelow Hieu Minh Nguyen Dylan Hadfield-Menell 19 2 0 03 Apr 2024
How to Train your Antivirus: RL-based Hardening through the Problem-Space Jacopo Cortellazzi Ilias Tsingenopoulos B. Bosanský Simone Aonzo Davy Preuveneers Wouter Joosen Fabio Pierazzi Lorenzo Cavallaro 21 2 0 29 Feb 2024
Exploring higher-order neural network node interactions with total correlation Thomas Kerby Teresa White Kevin Moon 22 0 0 06 Feb 2024
Transcending Adversarial Perturbations: Manifold-Aided Adversarial Examples with Legitimate Semantics Shuai Li Xiaoyu Jiang Xiaoguang Ma AAML 21 0 0 05 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 34 78 0 25 Jan 2024
DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations Maximilian Augustin Yannic Neuhaus Matthias Hein DiffM 37 4 0 29 Nov 2023
Adversarial Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights Ryoya Nara Yusuke Matsui AAML 29 0 0 27 Nov 2023
Corrupting Neuron Explanations of Deep Visual Features Divyansh Srivastava Tuomas P. Oikarinen Tsui-Wei Weng FAtt AAML 17 2 0 25 Oct 2023
Investigating the Adversarial Robustness of Density Estimation Using the Probability Flow ODE Marius Arvinte Cory Cornelius Jason Martin N. Himayat DiffM 46 3 0 10 Oct 2023
Physical Adversarial Attacks For Camera-based Smart Systems: Current Trends, Categorization, Applications, Research Challenges, and Future Outlook Amira Guesmi Muhammad Abdullah Hanif B. Ouni Muhammed Shafique AAML 23 21 0 11 Aug 2023
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Stephen Casper Xander Davies Claudia Shi T. Gilbert Jérémy Scheurer ... Erdem Biyik Anca Dragan David M. Krueger Dorsa Sadigh Dylan Hadfield-Menell ALM OffRL 47 472 0 27 Jul 2023
Red Teaming Deep Neural Networks with Feature Synthesis Tools Stephen Casper Yuxiao Li Jiawei Li Tong Bu Ke Zhang K. Hariharan Dylan Hadfield-Menell AAML 32 15 0 08 Feb 2023
Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks Stephen Casper K. Hariharan Dylan Hadfield-Menell AAML 18 11 0 18 Nov 2022
Physical Adversarial Attack meets Computer Vision: A Decade Survey Hui Wei Hao Tang Xuemei Jia Zhixiang Wang Han-Bing Yu Zhubo Li Shiníchi Satoh Luc Van Gool Zheng Wang AAML 29 43 0 30 Sep 2022
A Survey on Physical Adversarial Attack in Computer Vision Donghua Wang Wen Yao Tingsong Jiang Guijian Tang Xiaoqian Chen AAML 56 38 0 28 Sep 2022
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning Olivia Wiles Isabela Albuquerque Sven Gowal VLM 35 47 0 18 Aug 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks Tilman Raukur A. Ho Stephen Casper Dylan Hadfield-Menell AAML AI4CE 23 124 0 27 Jul 2022
Adversarial Training for High-Stakes Reliability Daniel M. Ziegler Seraphina Nix Lawrence Chan Tim Bauman Peter Schmidt-Nielsen ... Noa Nabeshima Benjamin Weinstein-Raun D. Haas Buck Shlegeris Nate Thomas AAML 30 59 0 03 May 2022
Adversarial Neon Beam: A Light-based Physical Attack to DNNs Chen-Hao Hu Weiwen Shi Wen Li AAML 35 8 0 02 Apr 2022
Natural Language Descriptions of Deep Visual Features Evan Hernandez Sarah Schwettmann David Bau Teona Bagashvili Antonio Torralba Jacob Andreas MILM 201 117 0 26 Jan 2022
Constructing Unrestricted Adversarial Examples with Generative Models Yang Song Rui Shu Nate Kushman Stefano Ermon GAN AAML 185 302 0 21 May 2018
Adversarial examples in the physical world Alexey Kurakin Ian Goodfellow Samy Bengio SILM AAML 287 5,837 0 08 Jul 2016