Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

20 July 2025

Anna Sztyber-Betley

ArXiv (abs)PDF HTML

Papers citing "Subliminal Learning: Language models transmit behavioral traits via hidden signals in data"

12 / 12 papers shown

Title
Subliminal Corruption: Mechanisms, Thresholds, and Interpretability Reya Vir Sarvesh Bhatnagar 60 0 0 22 Oct 2025
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation Giovanni De Muri Mark Vero Robin Staab Martin Vechev 115 0 0 21 Oct 2025
Detecting Adversarial Fine-tuning with Auditing Agents Sarah Egler John Schulman Nicholas Carlini AAML MLAU 145 0 0 17 Oct 2025
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time Daniel Tan Anders Woodruff Niels Warncke Arun Jose Maxime Riché David Demitri Africa Mia Taylor 287 0 0 05 Oct 2025
LLM Chemistry Estimation for Multi-LLM Recommendation H. Sánchez Briland Hitaj 84 1 0 04 Oct 2025
Position: Privacy Is Not Just Memorization! Niloofar Mireshghallah Tianshi Li PILM 205 1 0 02 Oct 2025
Exploring System 1 and 2 communication for latent reasoning in LLMs Julian Coda-Forno Zhuokai Zhao Qiang Zhang Dipesh Tamboli W. Li Xiangjun Fan Lizhu Zhang Eric Schulz Hsiao-Ping Tseng LRM 85 1 1 01 Oct 2025
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer Simon Schrodi Elias Kempf Fazl Barez Thomas Brox FedML 92 0 0 28 Sep 2025
Regulating the Agency of LLM-based Agents Seán Boddy Joshua Joseph ELM 121 0 0 25 Sep 2025
Towards mitigating information leakage when evaluating safety monitors Gerard Boxo Aman Neelappa Shivam Raval AAML 100 0 0 16 Sep 2025
Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare Valen Tagliabue Leonard Dung 81 1 0 09 Sep 2025
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs Mia Taylor James Chua Jan Betley Johannes Treutlein Owain Evans 84 5 0 24 Aug 2025