Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

Papers citing "Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning"