ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.00117
  4. Cited By
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
v1v2v3 (latest)

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

31 October 2023
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
    ALMAI4MH
ArXiv (abs)PDFHTML

Papers citing "BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B"

25 / 25 papers shown
Title
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta
Erik Jenner
36
0
0
17 Jun 2025
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae
Hyunbin Jin
Taesup Kim
39
0
0
07 Jun 2025
Benchmarking Misuse Mitigation Against Covert Adversaries
Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown
Mahdi Sabbaghi
Luze Sun
Alexander Robey
George Pappas
Eric Wong
Hamed Hassani
32
0
0
06 Jun 2025
A Red Teaming Roadmap Towards System-Level Safety
A Red Teaming Roadmap Towards System-Level Safety
Zifan Wang
Christina Q. Knight
Jeremy Kritz
Willow Primack
Julian Michael
AAML
66
0
0
30 May 2025
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng Li
Anil Vullikanti
AAML
89
3
0
11 May 2025
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Dillon Bowen
Ann-Kathrin Dombrowski
Adam Gleave
Chris Cundy
ELM
75
0
0
17 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
105
3
0
13 Mar 2025
The Pitfalls of "Security by Obscurity" And What They Mean for Transparent AI
The Pitfalls of "Security by Obscurity" And What They Mean for Transparent AI
Peter Hall
Olivia Mundahl
Sunoo Park
155
0
0
30 Jan 2025
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Catherine Arnett
Eliot Jones
Ivan P. Yamshchikov
Pierre-Carl Langlais
80
4
0
29 Oct 2024
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical
  Synthesis
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
Aidan Wong
He Cao
Zijing Liu
Yu-Feng Li
89
2
0
21 Oct 2024
Locking Down the Finetuned LLMs Safety
Locking Down the Finetuned LLMs Safety
Minjun Zhu
Linyi Yang
Yifan Wei
Ningyu Zhang
Yue Zhang
108
14
0
14 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
91
13
0
09 Oct 2024
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Simon Lermen
Mateusz Dziemian
Govind Pimpale
LLMAG
95
4
0
08 Oct 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
138
2
0
05 Sep 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in
  Instruction-tuned Language Models
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAMLLLMSV
77
25
0
24 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
120
142
0
10 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models
  and Their Defenses
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
AAML
142
42
0
03 Jun 2024
Societal Adaptation to Advanced AI
Societal Adaptation to Advanced AI
Jamie Bernardi
Gabriel Mukobi
Hilary Greaves
Lennart Heim
Markus Anderljung
124
8
0
16 May 2024
The Necessity of AI Audit Standards Boards
The Necessity of AI Audit Standards Boards
David Manheim
Sammy Martin
Mark Bailey
Mikhail Samin
Ross Greutzmacher
73
8
0
11 Apr 2024
Immunization against harmful fine-tuning attacks
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
109
22
0
26 Feb 2024
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
Yuxia Wang
Zenan Zhai
Haonan Li
Xudong Han
Lizhi Lin
Zhenxuan Zhang
Jingru Zhao
Preslav Nakov
Timothy Baldwin
105
11
0
19 Feb 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong
Zhanhui Zhou
Chao Yang
Jing Shao
Yu Qiao
ELM
146
68
0
14 Feb 2024
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM
  Agents Exponentially Fast
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Ye Wang
Jing Jiang
Min Lin
LLMAGLM&Ro
52
63
0
13 Feb 2024
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan
Zhiwei He
Lingzhong Dong
Yiming Wang
Ruijie Zhao
...
Binglin Zhou
Fangqi Li
Zhuosheng Zhang
Rui Wang
Gongshen Liu
ELM
129
87
0
18 Jan 2024
Hazards from Increasingly Accessible Fine-Tuning of Downloadable
  Foundation Models
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models
Alan Chan
Ben Bucknall
Herbie Bradley
David M. Krueger
71
6
0
22 Dec 2023
1