ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.23749
33
0

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

29 May 2025
Paul Gölz
Nika Haghtalab
Kunhe Yang
ArXiv (abs)PDFHTML
Main:5 Pages
4 Figures
Bibliography:1 Pages
1 Tables
Appendix:32 Pages
Abstract

After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy.The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of (12+o(1))⋅β(\frac{1}{2} + o(1)) \cdot \beta(21​+o(1))⋅β (for the BT temperature β\betaβ), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer ≥(1−o(1))⋅β\geq (1 - o(1)) \cdot \beta≥(1−o(1))⋅β distortion already without a KL constraint, and eΩ(β)e^{\Omega(\beta)}eΩ(β) or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.

View on arXiv
@article{gölz2025_2505.23749,
  title={ Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences? },
  author={ Paul Gölz and Nika Haghtalab and Kunhe Yang },
  journal={arXiv preprint arXiv:2505.23749},
  year={ 2025 }
}
Comments on this paper