53
8

The Crucial Role of Samplers in Online Direct Preference Optimization

Abstract

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves linear\textbf{linear} convergence, while our proposed online sampler achieves quadratic\textbf{quadratic} convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over 7.47.4% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

View on arXiv
@article{shi2025_2409.19605,
  title={ The Crucial Role of Samplers in Online Direct Preference Optimization },
  author={ Ruizhe Shi and Runlong Zhou and Simon S. Du },
  journal={arXiv preprint arXiv:2409.19605},
  year={ 2025 }
}
Comments on this paper