Fake it till You Make it: Reward Modeling as Discriminative Prediction

16 June 2025

Main:9 Pages

11 Figures

Bibliography:4 Pages

9 Tables

Appendix:6 Pages

Abstract

An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

View on arXiv

@article{liu2025_2506.13846,
  title={ Fake it till You Make it: Reward Modeling as Discriminative Prediction },
  author={ Runtao Liu and Jiahao Zhan and Yingqing He and Chen Wei and Alan Yuille and Qifeng Chen },
  journal={arXiv preprint arXiv:2506.13846},
  year={ 2025 }
}

Comments on this paper