Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Papers citing "Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference"

50 / 57 papers shown
Title
KTO: Model Alignment as Prospect Theoretic Optimization
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
Winnie Xu
Niklas Muennighoff
Dan Jurafsky
Douwe Kiela
238
532
0
02 Feb 2024