ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.11746
  4. Cited By
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned
  Language Models through Task Arithmetic

Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

19 February 2024
Rishabh Bhardwaj
Do Duc Anh
Soujanya Poria
    MoMe
ArXivPDFHTML

Papers citing "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic"

19 / 19 papers shown
Title
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Wenhan Chang
Tianqing Zhu
Yu Zhao
Shuangyong Song
Ping Xiong
Wanlei Zhou
Yongxiang Li
63
0
0
23 May 2025
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
87
0
0
20 May 2025
Multi-Task Model Merging via Adaptive Weight Disentanglement
Multi-Task Model Merging via Adaptive Weight Disentanglement
Feng Xiong
Runxi Cheng
Wang Chen
Zhanqiu Zhang
Yiwen Guo
Chun Yuan
Ruifeng Xu
MoMe
170
6
0
10 Jan 2025
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
97
7
0
17 Nov 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELM
KELM
138
38
0
08 Apr 2024
OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs
OpinionGPT: Modelling Explicit Biases in Instruction-Tuned LLMs
Patrick Haller
Ansar Aynetdinov
Alan Akbik
63
25
0
07 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
282
1,436
0
27 Jul 2023
On the Exploitability of Instruction Tuning
On the Exploitability of Instruction Tuning
Manli Shu
Jiong Wang
Chen Zhu
Jonas Geiping
Chaowei Xiao
Tom Goldstein
SILM
89
97
0
28 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
312
4,253
0
09 Jun 2023
Patching open-vocabulary models by interpolating weights
Patching open-vocabulary models by interpolating weights
Gabriel Ilharco
Mitchell Wortsman
S. Gadre
Shuran Song
Hannaneh Hajishirzi
Simon Kornblith
Ali Farhadi
Ludwig Schmidt
VLM
KELM
84
175
0
10 Aug 2022
Fusing finetuned models for better pretraining
Fusing finetuned models for better pretraining
Leshem Choshen
Elad Venezian
Noam Slonim
Yoav Katz
FedML
AI4CE
MoMe
103
94
0
06 Apr 2022
KNOT: Knowledge Distillation using Optimal Transport for Solving NLP
  Tasks
KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks
Rishabh Bhardwaj
Tushar Vaidya
Soujanya Poria
OT
FedML
86
7
0
06 Oct 2021
Robust fine-tuning of zero-shot models
Robust fine-tuning of zero-shot models
Mitchell Wortsman
Gabriel Ilharco
Jong Wook Kim
Mike Li
Simon Kornblith
...
Raphael Gontijo-Lopes
Hannaneh Hajishirzi
Ali Farhadi
Hongseok Namkoong
Ludwig Schmidt
VLM
114
721
0
04 Sep 2021
Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
205
5,454
0
07 Jul 2021
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based
  Masked Language-models
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
Elad Ben-Zaken
Shauli Ravfogel
Yoav Goldberg
158
1,216
0
18 Jun 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
512
4,021
0
18 Apr 2021
Deep learning versus kernel learning: an empirical study of loss
  landscape geometry and the time evolution of the Neural Tangent Kernel
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
Stanislav Fort
Gintare Karolina Dziugaite
Mansheej Paul
Sepideh Kharaghani
Daniel M. Roy
Surya Ganguli
95
189
0
28 Oct 2020
What is being transferred in transfer learning?
What is being transferred in transfer learning?
Behnam Neyshabur
Hanie Sedghi
Chiyuan Zhang
83
519
0
26 Aug 2020
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
Edoardo Ponti
Goran Glavaš
Olga Majewska
Qianchu Liu
Ivan Vulić
Anna Korhonen
LRM
61
320
0
01 May 2020
1