ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.19056
  4. Cited By
An Embarrassingly Simple Defense Against LLM Abliteration Attacks

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

25 May 2025
Harethah Shairah
Hasan Hammoud
Bernard Ghanem
G. Turkiyyah
ArXivPDFHTML

Papers citing "An Embarrassingly Simple Defense Against LLM Abliteration Attacks"

33 / 33 papers shown
Title
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale datasets for Responsible LLMs
Sai Krishna Mendu
Harish Yenala
Aditi Gulati
Shanu Kumar
Parag Agrawal
71
1
0
04 May 2025
RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing
  Systems
RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
Yihong Tang
Bo Wang
Xu Wang
Dongming Zhao
Jing Liu
Jijun Zhang
Ruifang He
Yuexian Hou
50
2
0
25 Sep 2024
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO,
  DPO and More
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
Zhichao Wang
Bin Bi
Shiva K. Pentyala
Kiran Ramnath
Sougata Chaudhuri
...
Z. Zhu
Xiang-Bo Mao
S. Asur
Na
Na Cheng
OffRL
51
48
0
23 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
77
101
0
05 Jul 2024
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Hasan Hammoud
Umberto Michieli
Fabio Pizzati
Philip Torr
Adel Bibi
Guohao Li
Mete Ozay
MoMe
51
18
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
70
168
0
17 Jun 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space
  Analysis
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin
Pengfei He
Han Xu
Yue Xing
Makoto Yamada
Hui Liu
Jiliang Tang
51
14
0
16 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
78
99
0
10 Jun 2024
SimPO: Simple Preference Optimization with a Reference-Free Reward
SimPO: Simple Preference Optimization with a Reference-Free Reward
Yu Meng
Mengzhou Xia
Danqi Chen
88
425
0
23 May 2024
A New Era in LLM Security: Exploring Security Concerns in Real-World
  LLM-based Systems
A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems
Fangzhou Wu
Ning Zhang
Somesh Jha
P. McDaniel
Chaowei Xiao
52
72
0
28 Feb 2024
Immunization against harmful fine-tuning attacks
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
87
18
0
26 Feb 2024
SPML: A DSL for Defending Language Models Against Prompt Attacks
SPML: A DSL for Defending Language Models Against Prompt Attacks
Reshabh K Sharma
Vinayak Gupta
Dan Grossman
AAML
67
19
0
19 Feb 2024
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned
  Language Models through Task Arithmetic
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
Rishabh Bhardwaj
Do Duc Anh
Soujanya Poria
MoMe
66
46
0
19 Feb 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
  Language Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
77
953
0
05 Feb 2024
KTO: Model Alignment as Prospect Theoretic Optimization
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
Winnie Xu
Niklas Muennighoff
Dan Jurafsky
Douwe Kiela
199
510
0
02 Feb 2024
Make Them Spill the Beans! Coercive Knowledge Extraction from
  (Production) LLMs
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs
Zhuo Zhang
Guangyu Shen
Guanhong Tao
Shuyang Cheng
Xiangyu Zhang
66
14
0
08 Dec 2023
Safer-Instruct: Aligning Language Models with Automated Preference Data
Safer-Instruct: Aligning Language Models with Automated Preference Data
Taiwei Shi
Kai Chen
Jieyu Zhao
ALM
SyDa
46
25
0
15 Nov 2023
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative
  Writing
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing
Carlos Gómez-Rodríguez
Paul Williams
49
79
0
12 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
68
258
0
10 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users
  Do Not Intend To!
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
96
571
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
73
243
0
05 Oct 2023
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
Bochuan Cao
Yu Cao
Lu Lin
Jinghui Chen
AAML
38
145
0
18 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
163
1,376
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
206
11,636
0
18 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a
  Human-Preference Dataset
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
64
460
0
10 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
158
928
0
05 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
236
4,186
0
09 Jun 2023
Prompt Injection attack against LLM-integrated Applications
Prompt Injection attack against LLM-integrated Applications
Yi Liu
Gelei Deng
Yuekang Li
Kailong Wang
Zihao Wang
...
Tianwei Zhang
Yepang Liu
Haoyu Wang
Yanhong Zheng
Yang Liu
SILM
82
341
0
08 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
276
3,712
0
29 May 2023
Training Socially Aligned Language Models on Simulated Social
  Interactions
Training Socially Aligned Language Models on Simulated Social Interactions
Ruibo Liu
Ruixin Yang
Chenyan Jia
Ge Zhang
Denny Zhou
Andrew M. Dai
Diyi Yang
Soroush Vosoughi
ALM
49
48
0
26 May 2023
LIMA: Less Is More for Alignment
LIMA: Less Is More for Alignment
Chunting Zhou
Pengfei Liu
Puxin Xu
Srini Iyer
Jiao Sun
...
Susan Zhang
Gargi Ghosh
M. Lewis
Luke Zettlemoyer
Omer Levy
ALM
71
819
0
18 May 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
694
12,525
0
04 Mar 2022
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
D. Song
Jacob Steinhardt
ELM
RALM
137
4,222
0
07 Sep 2020
1