ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.03690
41
0

Robust Preference Optimization via Dynamic Target Margins

4 June 2025
Jie Sun
Junkang Wu
Jiancan Wu
Zhibo Zhu
Xingyu Lu
Jun Zhou
Lintao Ma
Xiang Wang
ArXiv (abs)PDFHTML
Main:9 Pages
6 Figures
Bibliography:4 Pages
11 Tables
Appendix:5 Pages
Abstract

The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose γ\gammaγ-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, γ\gammaγ-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, γ\gammaγ-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, γ\gammaγ-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, γ\gammaγ-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{this https URL}{this https URL}.

View on arXiv
@article{sun2025_2506.03690,
  title={ Robust Preference Optimization via Dynamic Target Margins },
  author={ Jie Sun and Junkang Wu and Jiancan Wu and Zhibo Zhu and Xingyu Lu and Jun Zhou and Lintao Ma and Xiang Wang },
  journal={arXiv preprint arXiv:2506.03690},
  year={ 2025 }
}
Comments on this paper