Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

4 July 2024

Papers citing "Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs"

3 / 3 papers shown

Title
AI Sandbagging: Language Models can Strategically Underperform on Evaluations Teun van der Weij Felix Hofstätter Ollie Jaffe Samuel F. Brown Francis Rhys Ward ELM 45 23 0 11 Jun 2024
Poisoning Language Models During Instruction Tuning Alexander Wan Eric Wallace Sheng Shen Dan Klein SILM 92 124 0 01 May 2023
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021