Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models
Papers citing "Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models"
11 / 11 papers shown
Title |
---|
![]() Secrets of RLHF in Large Language Models Part II: Reward Modeling Bing Wang Rui Zheng Luyao Chen Yan Liu Shihan Dou ...Qi Zhang Xipeng Qiu Xuanjing Huang Zuxuan Wu Yuanyuan Jiang |