Fake it till You Make it: Reward Modeling as Discriminative Prediction

An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
View on arXiv@article{liu2025_2506.13846, title={ Fake it till You Make it: Reward Modeling as Discriminative Prediction }, author={ Runtao Liu and Jiahao Zhan and Yingqing He and Chen Wei and Alan Yuille and Qifeng Chen }, journal={arXiv preprint arXiv:2506.13846}, year={ 2025 } }