
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Papers citing "Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space"
50 / 86 papers shown
Title |
---|
![]() SOS! Soft Prompt Attack Against Open-Source Large Language Models Ziqing Yang Michael Backes Yang Zhang Ahmed Salem |
![]() Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations Hakan Inan Kartikeya Upasani Jianfeng Chi Rashi Rungta Krithika Iyer ...Michael Tontchev Qing Hu Brian Fuller Davide Testuggine Madian Khabsa |
![]() Mistral 7B Albert Q. Jiang Alexandre Sablayrolles A. Mensch Chris Bamford Devendra Singh Chaplot ...Teven Le Scao Thibaut Lavril Thomas Wang Timothée Lacroix William El Sayed |