Title |
---|
![]() Unpacking DPO and PPO: Disentangling Best Practices for Learning from
Preference Feedback Hamish Ivison Yizhong Wang Jiacheng Liu Zeqiu Wu Valentina Pyatkin Nathan Lambert Noah A. Smith Yejin Choi Hannaneh Hajishirzi |
![]() RewardBench: Evaluating Reward Models for Language Modeling Nathan Lambert Valentina Pyatkin Jacob Morrison Lester James V. Miranda Bill Yuchen Lin ...Sachin Kumar Tom Zick Yejin Choi Noah A. Smith Hanna Hajishirzi |