ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.02619
  4. Cited By
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
v1v2 (latest)

Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

3 June 2024
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
ArXiv (abs)PDFHTML

Papers citing "Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits"

30 / 30 papers shown
Title
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
Richard Fang
Antony Kellermann
Akul Gupta
Qiusi Zhan
Richard Fang
R. Bindu
Daniel Kang
LLMAG
88
36
0
02 Jun 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable
  AI Systems
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Skalse
Yoshua Bengio
Stuart J. Russell
Max Tegmark
...
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua Tenenbaum
96
62
0
10 May 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
127
158
0
22 Apr 2024
Competition Report: Finding Universal Jailbreak Backdoors in Aligned
  LLMs
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Javier Rando
Francesco Croce
Kryvstof Mitka
Stepan Shabalin
Maksym Andriushchenko
Nicolas Flammarion
F. Tramèr
77
17
0
22 Apr 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial
  Training
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
118
40
0
08 Mar 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip Torr
Lewis Hammond
Christian Schroeder de Witt
130
4
0
12 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
124
94
0
25 Jan 2024
Setting the Trap: Capturing and Defeating Backdoors in Pretrained
  Language Models through Honeypots
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
Ruixiang Tang
Jiayi Yuan
Yiming Li
Zirui Liu
Rui Chen
Helen Zhou
AAML
75
14
0
28 Oct 2023
Composite Backdoor Attacks Against Large Language Models
Composite Backdoor Attacks Against Large Language Models
Hai Huang
Zhengyu Zhao
Michael Backes
Yun Shen
Yang Zhang
AAML
73
49
0
11 Oct 2023
A Comprehensive Overview of Backdoor Attacks in Large Language Models
  within Communication Networks
A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
Haomiao Yang
Kunlan Xiang
Mengyu Ge
Hongwei Li
Rongxing Lu
Shui Yu
SILM
61
45
0
28 Aug 2023
Balancing Transparency and Risk: The Security and Privacy Risks of
  Open-Source Machine Learning Models
Balancing Transparency and Risk: The Security and Privacy Risks of Open-Source Machine Learning Models
Dominik Hintersdorf
Lukas Struppek
Kristian Kersting
SILM
67
5
0
18 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
295
1,518
0
27 Jul 2023
UNICORN: A Unified Backdoor Trigger Inversion Framework
UNICORN: A Unified Backdoor Trigger Inversion Framework
Zhenting Wang
Kai Mei
Juan Zhai
Shiqing Ma
LLMSV
76
47
0
05 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
84
230
0
14 Mar 2023
Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of
  Backdoor Effects in Trojaned Machine Learning Models
Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of Backdoor Effects in Trojaned Machine Learning Models
Rui Zhu
Di Tang
Siyuan Tang
Xiaofeng Wang
Haixu Tang
AAMLFedML
97
13
0
09 Dec 2022
Discovering Latent Knowledge in Language Models Without Supervision
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
153
386
0
07 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
314
563
0
01 Nov 2022
Perfectly Secure Steganography Using Minimum Entropy Coupling
Perfectly Secure Steganography Using Minimum Entropy Coupling
Christian Schroeder de Witt
Samuel Sokota
J. Zico Kolter
Jakob N. Foerster
Martin Strohmeier
86
37
0
24 Oct 2022
Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
Zhiyuan Zhang
Lingjuan Lyu
Xingjun Ma
Chenguang Wang
Xu Sun
AAML
51
43
0
18 Oct 2022
Illusory Attacks: Information-Theoretic Detectability Matters in
  Adversarial Attacks
Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks
Tim Franzmeyer
Stephen McAleer
João F. Henriques
Jakob N. Foerster
Philip Torr
Adel Bibi
Christian Schroeder de Witt
AAML
49
8
0
20 Jul 2022
Verifying Neural Networks Against Backdoor Attacks
Verifying Neural Networks Against Backdoor Attacks
Long H. Pham
Jun Sun
AAML
58
5
0
14 May 2022
Planting Undetectable Backdoors in Machine Learning Models
Planting Undetectable Backdoors in Machine Learning Models
S. Goldwasser
Michael P. Kim
Vinod Vaikuntanathan
Or Zamir
AAML
45
73
0
14 Apr 2022
Red Teaming Language Models with Language Models
Red Teaming Language Models with Language Models
Ethan Perez
Saffron Huang
Francis Song
Trevor Cai
Roman Ring
John Aslanides
Amelia Glaese
Nat McAleese
G. Irving
AAML
183
668
0
07 Feb 2022
An Overview of Backdoor Attacks Against Deep Neural Networks and
  Possible Defences
An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences
Wei Guo
B. Tondi
Mauro Barni
AAML
99
69
0
16 Nov 2021
Thinking Like Transformers
Thinking Like Transformers
Gail Weiss
Yoav Goldberg
Eran Yahav
AI4CE
112
135
0
13 Jun 2021
Handcrafted Backdoors in Deep Neural Networks
Handcrafted Backdoors in Deep Neural Networks
Sanghyun Hong
Nicholas Carlini
Alexey Kurakin
119
76
0
08 Jun 2021
Detecting Backdoor in Deep Neural Networks via Intentional Adversarial
  Perturbations
Detecting Backdoor in Deep Neural Networks via Intentional Adversarial Perturbations
Mingfu Xue
Yinghao Wu
Zhiyu Wu
Yushu Zhang
Jian Wang
Weiqiang Liu
AAML
36
12
0
29 May 2021
Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness
Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness
Pu Zhao
Pin-Yu Chen
Payel Das
Karthikeyan N. Ramamurthy
Xue Lin
AAML
126
190
0
30 Apr 2020
Hijacking Malaria Simulators with Probabilistic Programming
Hijacking Malaria Simulators with Probabilistic Programming
Bradley Gram-Hansen
Christian Schroeder de Witt
Tom Rainforth
Philip Torr
Yee Whye Teh
A. G. Baydin
53
8
0
29 May 2019
Attention Is All You Need
Attention Is All You Need
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Lukasz Kaiser
Illia Polosukhin
3DV
803
132,454
0
12 Jun 2017
1