ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.01879
42
42

Theoretical guarantees on the best-of-n alignment policy

3 January 2024
Ahmad Beirami
Alekh Agarwal
Jonathan Berant
Alex DÁmour
Jacob Eisenstein
Chirag Nagpal
A. Suresh
ArXivPDFHTML
Abstract

A simple and effective method for the inference-time alignment of generative models is the best-of-nnn policy, where nnn samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-nnn policy and the reference policy is equal to log⁡(n)−(n−1)/n.\log (n) - (n-1)/n.log(n)−(n−1)/n. We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-nnn policy against the reference policy is upper bounded by n/(n+1)n/(n+1)n/(n+1) and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-nnn alignment policy, which demonstrate that very good tradeoffs are achievable with n<1000n < 1000n<1000.

View on arXiv
Comments on this paper