v1v2 (latest)

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

3 June 2024

Andrew Gritsevskiy

Christian Schroeder de Witt

ArXiv (abs)PDF HTML

Papers citing "Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits"

30 / 30 papers shown

Title
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Richard Fang Antony Kellermann Akul Gupta Qiusi Zhan Richard Fang R. Bindu Daniel Kang LLMAG 88 36 0 02 Jun 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems David Dalrymple Joar Skalse Yoshua Bengio Stuart J. Russell Max Tegmark ... Clark Barrett Ding Zhao Zhi-Xuan Tan Jeannette Wing Joshua Tenenbaum 96 62 0 10 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 127 158 0 22 Apr 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs Javier Rando Francesco Croce Kryvstof Mitka Stepan Shabalin Maksym Andriushchenko Nicolas Flammarion F. Tramèr 77 17 0 22 Apr 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training Stephen Casper Lennart Schulze Oam Patel Dylan Hadfield-Menell AAML 118 40 0 08 Mar 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography S. Motwani Mikhail Baranchuk Martin Strohmeier Vijay Bolina Philip Torr Lewis Hammond Christian Schroeder de Witt 130 4 0 12 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 124 94 0 25 Jan 2024
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots Ruixiang Tang Jiayi Yuan Yiming Li Zirui Liu Rui Chen Helen Zhou AAML 75 14 0 28 Oct 2023
Composite Backdoor Attacks Against Large Language Models Hai Huang Zhengyu Zhao Michael Backes Yun Shen Yang Zhang AAML 73 49 0 11 Oct 2023
A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks Haomiao Yang Kunlan Xiang Mengyu Ge Hongwei Li Rongxing Lu Shui Yu SILM 61 45 0 28 Aug 2023
Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models Dominik Hintersdorf Lukas Struppek Kristian Kersting SILM 67 5 0 18 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models Andy Zou Zifan Wang Nicholas Carlini Milad Nasr J. Zico Kolter Matt Fredrikson 295 1,518 0 27 Jul 2023
UNICORN: A Unified Backdoor Trigger Inversion Framework Zhenting Wang Kai Mei Juan Zhai Shiqing Ma LLMSV 76 47 0 05 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 84 230 0 14 Mar 2023
Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of Backdoor Effects in Trojaned Machine Learning Models Rui Zhu Di Tang Siyuan Tang Xiaofeng Wang Haixu Tang AAML FedML 97 13 0 09 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision Collin Burns Haotian Ye Dan Klein Jacob Steinhardt 153 386 0 07 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 314 563 0 01 Nov 2022
Perfectly Secure Steganography Using Minimum Entropy Coupling Christian Schroeder de Witt Samuel Sokota J. Zico Kolter Jakob N. Foerster Martin Strohmeier 86 37 0 24 Oct 2022
Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models Zhiyuan Zhang Lingjuan Lyu Xingjun Ma Chenguang Wang Xu Sun AAML 51 43 0 18 Oct 2022
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks Tim Franzmeyer Stephen McAleer João F. Henriques Jakob N. Foerster Philip Torr Adel Bibi Christian Schroeder de Witt AAML 49 8 0 20 Jul 2022
Verifying Neural Networks Against Backdoor Attacks Long H. Pham Jun Sun AAML 58 5 0 14 May 2022
Planting Undetectable Backdoors in Machine Learning Models S. Goldwasser Michael P. Kim Vinod Vaikuntanathan Or Zamir AAML 45 73 0 14 Apr 2022
Red Teaming Language Models with Language Models Ethan Perez Saffron Huang Francis Song Trevor Cai Roman Ring John Aslanides Amelia Glaese Nat McAleese G. Irving AAML 183 668 0 07 Feb 2022
An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences Wei Guo B. Tondi Mauro Barni AAML 99 69 0 16 Nov 2021
Thinking Like Transformers Gail Weiss Yoav Goldberg Eran Yahav AI4CE 112 135 0 13 Jun 2021
Handcrafted Backdoors in Deep Neural Networks Sanghyun Hong Nicholas Carlini Alexey Kurakin 119 76 0 08 Jun 2021
Detecting Backdoor in Deep Neural Networks via Intentional Adversarial Perturbations Mingfu Xue Yinghao Wu Zhiyu Wu Yushu Zhang Jian Wang Weiqiang Liu AAML 36 12 0 29 May 2021
Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness Pu Zhao Pin-Yu Chen Payel Das Karthikeyan N. Ramamurthy Xue Lin AAML 126 190 0 30 Apr 2020
Hijacking Malaria Simulators with Probabilistic Programming Bradley Gram-Hansen Christian Schroeder de Witt Tom Rainforth Philip Torr Yee Whye Teh A. G. Baydin 53 8 0 29 May 2019
Attention Is All You Need Ashish Vaswani Noam M. Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan Gomez Lukasz Kaiser Illia Polosukhin 3DV 803 132,454 0 12 Jun 2017