524
v1v2 (latest)

Reinforcement Learning from Human Feedback with Active Queries

Main:12 Pages
3 Figures
Bibliography:2 Pages
7 Tables
Appendix:20 Pages
Abstract

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an O~(d2/Δ)\tilde{O}(d^2/\Delta) instance-dependent regret bound and an O~(d2/Δ2)\tilde{O}(d^2/\Delta^2) query complexity, where dd is the dimension of feature space and Δ\Delta is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

View on arXiv
Comments on this paper