Make Them Spill the Beans! Coercive Knowledge Extraction from
(Production) LLMs

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs

8 December 2023

Papers citing "Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs"

12 / 12 papers shown

Title
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates Kaifeng Lyu Haoyu Zhao Xinran Gu Dingli Yu Anirudh Goyal Sanjeev Arora ALM 82 44 0 20 Jan 2025
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models Delong Ran Jinyuan Liu Yichen Gong Jingyi Zheng Xinlei He Tianshuo Cong Anyu Wang ELM 47 10 0 13 Jun 2024
The Mosaic Memory of Large Language Models Igor Shilov Matthieu Meeus Yves-Alexandre de Montjoye 47 3 0 24 May 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack M. Russinovich Ahmed Salem Ronen Eldan 56 77 0 02 Apr 2024
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs Aly M. Kassem Omar Mahmoud Niloofar Mireshghallah Hyunwoo J. Kim Yulia Tsvetkov Yejin Choi Sherif Saad Santu Rana 50 18 0 05 Mar 2024
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Jiahao Yu Xingwei Lin Zheng Yu Xinyu Xing SILM 117 301 0 19 Sep 2023
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 227 502 0 28 Sep 2022
Large Language Models are Zero-Shot Reasoners Takeshi Kojima S. Gu Machel Reid Yutaka Matsuo Yusuke Iwasawa ReLM LRM 328 4,077 0 24 May 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 325 11,953 0 04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 389 8,495 0 28 Jan 2022
Gradient-based Adversarial Attacks against Text Transformers Chuan Guo Alexandre Sablayrolles Hervé Jégou Douwe Kiela SILM 106 227 0 15 Apr 2021
A Decomposable Attention Model for Natural Language Inference Ankur P. Parikh Oscar Täckström Dipanjan Das Jakob Uszkoreit 213 1,367 0 06 Jun 2016