ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.09401
22
17

Reinforcement Learning from Human Feedback with Active Queries

14 February 2024
Kaixuan Ji
Jiafan He
Quanquan Gu
ArXivPDFHTML
Abstract

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an O~(d2/Δ)\tilde{O}(d^2/\Delta)O~(d2/Δ) instance-dependent regret bound and an O~(d2/Δ2)\tilde{O}(d^2/\Delta^2)O~(d2/Δ2) query complexity, where ddd is the dimension of feature space and Δ\DeltaΔ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

View on arXiv
@article{ji2025_2402.09401,
  title={ Reinforcement Learning from Human Feedback with Active Queries },
  author={ Kaixuan Ji and Jiafan He and Quanquan Gu },
  journal={arXiv preprint arXiv:2402.09401},
  year={ 2025 }
}
Comments on this paper