Reward Model Interpretability via Optimal and Pessimal Tokens

8 June 2025

Main:10 Pages

5 Figures

Bibliography:1 Pages

5 Tables

Appendix:1 Pages

Abstract

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

View on arXiv

@article{christian2025_2506.07326,
  title={ Reward Model Interpretability via Optimal and Pessimal Tokens },
  author={ Brian Christian and Hannah Rose Kirk and Jessica A.F. Thompson and Christopher Summerfield and Tsvetomira Dumbalska },
  journal={arXiv preprint arXiv:2506.07326},
  year={ 2025 }
}

Comments on this paper