Papers citing 'Subliminal Learning: Language models transmit behavioral traits via hidden signals in data'

Title
Subliminal Corruption: Mechanisms, Thresholds, and Interpretability Reya Vir Sarvesh Bhatnagar 52 0 0 22 Oct 2025
Pay Attention to the Triggers: Constructing Backdoors That Survive Distillation Giovanni De Muri Mark Vero Robin Staab Martin Vechev 115 0 0 21 Oct 2025
Detecting Adversarial Fine-tuning with Auditing Agents Sarah Egler John Schulman Nicholas Carlini AAML MLAU 141 0 0 17 Oct 2025
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time Daniel Tan Anders Woodruff Niels Warncke Arun Jose Maxime Riché David Demitri Africa Mia Taylor 283 0 0 05 Oct 2025
LLM Chemistry Estimation for Multi-LLM Recommendation H. Sánchez Briland Hitaj 80 1 0 04 Oct 2025
Position: Privacy Is Not Just Memorization! Niloofar Mireshghallah Tianshi Li PILM 201 1 0 02 Oct 2025
Exploring System 1 and 2 communication for latent reasoning in LLMs Julian Coda-Forno Zhuokai Zhao Qiang Zhang Dipesh Tamboli W. Li Xiangjun Fan Lizhu Zhang Eric Schulz Hsiao-Ping Tseng LRM 85 1 1 01 Oct 2025
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer Simon Schrodi Elias Kempf Fazl Barez Thomas Brox FedML 92 0 0 28 Sep 2025
Regulating the Agency of LLM-based Agents Seán Boddy Joshua Joseph ELM 117 0 0 25 Sep 2025
Towards mitigating information leakage when evaluating safety monitors Gerard Boxo Aman Neelappa Shivam Raval AAML 96 0 0 16 Sep 2025
Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare Valen Tagliabue Leonard Dung 81 1 0 09 Sep 2025
School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs Mia Taylor James Chua Jan Betley Johannes Treutlein Owain Evans 84 5 0 24 Aug 2025