Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.18588
Cited By
Safety Alignment via Constrained Knowledge Unlearning
24 May 2025
Zesheng Shi
Yucheng Zhou
Jing Li
MU
KELM
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Safety Alignment via Constrained Knowledge Unlearning"
44 / 44 papers shown
Title
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing
Yifan Lu
Jing Li
Yigeng Zhou
Yihui Zhang
Wenya Wang
Xiucheng Li
Meishan Zhang
Fangming Liu
Jun-chen Yu
Min Zhang
KELM
CLL
32
1
0
28 May 2025
Multi-objective Large Language Model Alignment with Hierarchical Experts
Zhuo Li
Guodong DU
Weiyang Guo
Yigeng Zhou
Xiucheng Li
...
Fangming Liu
Yequan Wang
Deheng Ye
Min Zhang
Jing Li
ALM
MoE
41
0
0
27 May 2025
ReaderLM-v2: Small Language Model for HTML to Markdown and JSON
Feng Wang
Zesheng Shi
Bo Wang
Nan Wang
Han Xiao
RALM
83
3
0
03 Mar 2025
Knowledge Editing with Dynamic Knowledge Graphs for Multi-Hop Question Answering
Yaojie Lu
Yimiao Zhou
Junlin Li
Yanjie Wang
Xuebo Liu
Daojing He
Fengyuan Liu
Min Zhang
KELM
93
3
0
18 Dec 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
77
101
0
05 Jul 2024
Knowledge Fusion By Evolving Weights of Language Models
Guodong DU
Yiyao Cao
Hanting Liu
Runhua Jiang
Shuyang Yu
Yifei Guo
Sim Kuan Goh
Jing Li
MoMe
57
15
0
18 Jun 2024
Improving Alignment and Robustness with Circuit Breakers
Andy Zou
Long Phan
Justin Wang
Derek Duenas
Maxwell Lin
Maksym Andriushchenko
Rowan Wang
Zico Kolter
Matt Fredrikson
Dan Hendrycks
AAML
77
97
0
06 Jun 2024
Multimodal Reasoning with Multimodal Knowledge Graph
Junlin Lee
Yequan Wang
Jing Li
Min Zhang
57
21
0
04 Jun 2024
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia
Tianyu Pang
Chao Du
Yihao Huang
Jindong Gu
Yang Liu
Xiaochun Cao
Min Lin
AAML
62
34
0
31 May 2024
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
Weikai Lu
Huiping Zhuang
Jianwei Wang
Zhengdong Lu
Zelin Chen
Huiping Zhuang
Cen Chen
MU
AAML
KELM
47
28
0
08 Apr 2024
User Modeling Challenges in Interactive AI Assistant Systems
Megan Su
Yuwei Bao
18
2
0
29 Mar 2024
Improving the Robustness of Large Language Models via Consistency Alignment
Zhao Yukun
Lingyong Yan
Weiwei Sun
Guoliang Xing
Shuaiqiang Wang
Meng Chong
Zhicong Cheng
Zhaochun Ren
Yin Dawei
49
21
0
21 Mar 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
78
96
0
07 Feb 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
62
284
0
12 Jan 2024
Understanding the Effect of Model Compression on Social Bias in Large Language Models
Gustavo Gonçalves
Emma Strubell
43
11
0
09 Dec 2023
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
45
21
0
10 Nov 2023
Unlearn What You Want to Forget: Efficient Unlearning for LLMs
Jiaao Chen
Diyi Yang
MU
53
146
0
31 Oct 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
76
330
0
19 Oct 2023
Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration
Fanqi Wan
Xinting Huang
Tao Yang
Xiaojun Quan
Wei Bi
Shuming Shi
ALM
69
19
0
13 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
72
642
0
12 Oct 2023
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Yangsibo Huang
Samyak Gupta
Mengzhou Xia
Kai Li
Danqi Chen
AAML
46
293
0
10 Oct 2023
Understanding the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk
Ishita Mediratta
Christoforos Nalmpantis
Jelena Luketina
Eric Hambro
Edward Grefenstette
Roberta Raileanu
AI4CE
ALM
141
135
0
10 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
96
571
0
05 Oct 2023
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Erfan Shayegani
Yue Dong
Nael B. Abu-Ghazaleh
77
137
0
26 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
64
460
0
10 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
244
4,186
0
09 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
291
3,712
0
29 May 2023
Topic-Selective Graph Network for Topic-Focused Summarization
Shi Zesheng
Yucheng Zhou
67
4
0
25 Feb 2023
Debiasing isn't enough! -- On the Effectiveness of Debiasing MLMs and their Social Biases in Downstream Tasks
Masahiro Kaneko
Danushka Bollegala
Naoaki Okazaki
43
43
0
06 Oct 2022
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Joel Jang
Dongkeun Yoon
Sohee Yang
Sungmin Cha
Moontae Lee
Lajanugen Logeswaran
Minjoon Seo
KELM
PILM
MU
161
206
0
04 Oct 2022
A Holistic Approach to Undesired Content Detection in the Real World
Todor Markov
Chong Zhang
Sandhini Agarwal
Tyna Eloundou
Teddy Lee
Steven Adler
Angela Jiang
L. Weng
34
228
0
05 Aug 2022
Quark: Controllable Text Generation with Reinforced Unlearning
Ximing Lu
Sean Welleck
Jack Hessel
Liwei Jiang
Lianhui Qin
Peter West
Prithviraj Ammanabrolu
Yejin Choi
MU
95
213
0
26 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
703
12,525
0
04 Mar 2022
A Simple but Effective Bidirectional Framework for Relational Triple Extraction
Feiliang Ren
Longhui Zhang
Xiaofeng Zhao
Shujuan Yin
Shilei Liu
Bochao Li
46
54
0
09 Dec 2021
A Novel Global Feature-Oriented Relational Triple Extraction Model based on Table Filling
Feiliang Ren
Longhui Zhang
Shujuan Yin
Xiaofeng Zhao
Shilei Liu
Bochao Li
Yaduo Liu
59
73
0
14 Sep 2021
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
69
3,678
0
03 Sep 2021
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
113
792
0
29 Dec 2020
Machine Unlearning
Lucas Bourtoule
Varun Chandrasekaran
Christopher A. Choquette-Choo
Hengrui Jia
Adelin Travers
Baiwu Zhang
David Lie
Nicolas Papernot
MU
101
830
0
09 Dec 2019
Are Sixteen Heads Really Better than One?
Paul Michel
Omer Levy
Graham Neubig
MoE
80
1,051
0
25 May 2019
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
Yejin Choi
83
2,373
0
19 May 2019
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor
Jonathan Herzig
Nicholas Lourie
Jonathan Berant
RALM
115
1,677
0
02 Nov 2018
SNIP: Single-shot Network Pruning based on Connection Sensitivity
Namhoon Lee
Thalaiyasingam Ajanthan
Philip Torr
VLM
211
1,190
0
04 Oct 2018
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov
Peter Clark
Tushar Khot
Ashish Sabharwal
74
1,475
0
08 Sep 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
658
7,080
0
20 Apr 2018
1