
v1v2 (latest)
Interpreting Language Reward Models via Contrastive Explanations
Papers citing "Interpreting Language Reward Models via Contrastive Explanations"
37 / 37 papers shown
Title |
---|
![]() RewardBench: Evaluating Reward Models for Language Modeling Nathan Lambert Valentina Pyatkin Jacob Morrison Lester James V. Miranda Bill Yuchen Lin ...Sachin Kumar Tom Zick Yejin Choi Noah A. Smith Hanna Hajishirzi |