ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.18118
  4. Cited By
SafeAligner: Safety Alignment against Jailbreak Attacks via Response
  Disparity Guidance
v1v2 (latest)

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

26 June 2024
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
Sixian Li
Xiao Wang
Enyu Zhou
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
    LLMSVAAML
ArXiv (abs)PDFHTML

Papers citing "SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance"

15 / 15 papers shown
Title
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
161
1
0
09 Oct 2024
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
Yuying Shang
Yutao Zhu
Jingyuan Zhang
Yu Tian
AAML
482
4
0
09 Oct 2024
InferAligner: Inference-Time Alignment for Harmlessness through
  Cross-Model Guidance
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
Pengyu Wang
Dong Zhang
Linyang Li
Chenkun Tan
Xinghao Wang
Ke Ren
Botian Jiang
Xipeng Qiu
LLMSV
91
49
0
20 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
95
317
0
12 Jan 2024
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
126
257
0
05 Oct 2023
Qwen Technical Report
Qwen Technical Report
Jinze Bai
Shuai Bai
Yunfei Chu
Zeyu Cui
Kai Dang
...
Zhenru Zhang
Chang Zhou
Jingren Zhou
Xiaohuan Zhou
Tianhang Zhu
OSLM
268
1,908
0
28 Sep 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated
  Jailbreak Prompts
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
217
352
0
19 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
295
1,518
0
27 Jul 2023
MasterKey: Automated Jailbreak Across Multiple Large Language Model
  Chatbots
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
Gelei Deng
Yi Liu
Yuekang Li
Kailong Wang
Ying Zhang
Zefeng Li
Haoyu Wang
Tianwei Zhang
Yang Liu
SILM
87
134
0
16 Jul 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALMOSLMELM
458
4,444
0
09 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
389
4,169
0
29 May 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.5K
14,761
0
15 Mar 2023
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRLAI4TSAI4CEALMAIMat
502
10,526
0
17 Jun 2021
Hierarchical Neural Story Generation
Hierarchical Neural Story Generation
Angela Fan
M. Lewis
Yann N. Dauphin
DiffM
183
1,628
0
13 May 2018
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
580
19,315
0
20 Jul 2017
1