More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

3 April 2025

Papers citing "More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment"

3 / 3 papers shown

Title
Universal and Transferable Adversarial Attacks on Aligned Language Models Andy Zou Zifan Wang Nicholas Carlini Milad Nasr J. Zico Kolter Matt Fredrikson 287 1,449 0 27 Jul 2023
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models Erfan Shayegani Yue Dong Nael B. Abu-Ghazaleh 85 145 0 26 Jul 2023
Proximal Policy Optimization Algorithms John Schulman Filip Wolski Prafulla Dhariwal Alec Radford Oleg Klimov OffRL 463 19,006 0 20 Jul 2017