The Poison of Alignment

The Poison of Alignment

25 August 2023

Papers citing "The Poison of Alignment"

12 / 12 papers shown

Title
Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives Xinliang Frederick Zhang Nick Beauchamp Lu Wang LRM AI4CE 27 3 0 07 Oct 2024
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing Blazej Manczak Eliott Zemour Eric Lin Vaikkunth Mugunthan 26 2 0 23 Jul 2024
Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads Avelina Asada Hadji-Kyriacou Ognjen Arandjelović 20 1 0 30 May 2024
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs Adi Simhi Jonathan Herzig Idan Szpektor Yonatan Belinkov HILM 48 10 0 15 Apr 2024
Nevermind: Instruction Override and Moderation in Large Language Models Edward Kim ALM 18 0 0 05 Feb 2024
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models Xianjun Yang Xiao Wang Qi Zhang Linda R. Petzold William Yang Wang Xun Zhao Dahua Lin 23 161 0 04 Oct 2023
Poisoning Language Models During Instruction Tuning Alexander Wan Eric Wallace Sheng Shen Dan Klein SILM 92 124 0 01 May 2023
Large Language Model Instruction Following: A Survey of Progresses and Challenges Renze Lou Kai Zhang Wenpeng Yin ALM LRM 29 20 0 18 Mar 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 313 11,953 0 04 Mar 2022
Deduplicating Training Data Makes Language Models Better Katherine Lee Daphne Ippolito A. Nystrom Chiyuan Zhang Douglas Eck Chris Callison-Burch Nicholas Carlini SyDa 242 593 0 14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 256 1,996 0 31 Dec 2020
Scaling Laws for Neural Language Models Jared Kaplan Sam McCandlish T. Henighan Tom B. Brown B. Chess R. Child Scott Gray Alec Radford Jeff Wu Dario Amodei 246 4,489 0 23 Jan 2020