
v1v2 (latest)
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Papers citing "Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning"
32 / 32 papers shown
Title |
---|
![]() Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large
Language Models Bofei Gao Feifan Song Zhiyong Yang Zefan Cai Yibo Miao ...Lei Sha Yichang Zhang Xuancheng Ren Tianyu Liu Baobao Chang |