Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1704.06440
Cited By
v1
v2
v3
v4 (latest)
Equivalence Between Policy Gradients and Soft Q-Learning
21 April 2017
John Schulman
Xi Chen
Pieter Abbeel
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Equivalence Between Policy Gradients and Soft Q-Learning"
19 / 19 papers shown
Title
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
Fanqi Wan
Weizhou Shen
Shengyi Liao
Yingcheng Shi
Chenliang Li
Ziyi Yang
Ji Zhang
Fei Huang
Jingren Zhou
Ming Yan
OffRL
LLMAG
ReLM
LRM
90
0
0
23 May 2025
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
Taiwei Shi
Yiyang Wu
Linxin Song
Dinesh Manocha
Jieyu Zhao
LRM
141
12
0
07 Apr 2025
Divergence-Augmented Policy Optimization
Qing Wang
Yingru Li
Jiechao Xiong
Tong Zhang
OffRL
155
16
0
28 Jan 2025
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Heyang Zhao
Chenlu Ye
Quanquan Gu
Tong Zhang
OffRL
216
6
0
07 Nov 2024
Value Improved Actor Critic Algorithms
Yaniv Oren
Moritz A. Zanger
Pascal R. van der Vaart
M. Spaan
Wendelin Bohmer
Wendelin Bohmer
OffRL
77
0
0
03 Jun 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
116
19
0
28 May 2024
A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints
Bram De Cooman
Johan A. K. Suykens
64
0
0
25 Apr 2024
Imitation-regularized Optimal Transport on Networks: Provable Robustness and Application to Logistics Planning
Koshi Oishi
Yota Hashizume
Tomohiko Jimbo
Hirotaka Kaji
Kenji Kashima
OOD
76
2
0
28 Feb 2024
q-Learning in Continuous Time
Yanwei Jia
X. Zhou
OffRL
90
76
0
02 Jul 2022
Discrete Sequential Prediction of Continuous Actions for Deep RL
Luke Metz
Julian Ibarz
Navdeep Jaitly
James Davidson
BDL
OffRL
74
120
0
14 May 2017
Bridging the Gap Between Value and Policy Based Reinforcement Learning
Ofir Nachum
Mohammad Norouzi
Kelvin Xu
Dale Schuurmans
158
472
0
28 Feb 2017
Reinforcement Learning with Deep Energy-Based Policies
Tuomas Haarnoja
Haoran Tang
Pieter Abbeel
Sergey Levine
108
1,340
0
27 Feb 2017
Combining policy gradient and Q-learning
Brendan O'Donoghue
Rémi Munos
Koray Kavukcuoglu
Volodymyr Mnih
OffRL
OnRL
76
139
0
05 Nov 2016
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih
Adria Puigdomenech Badia
M. Berk Mirza
Alex Graves
Timothy Lillicrap
Tim Harley
David Silver
Koray Kavukcuoglu
199
8,859
0
04 Feb 2016
Taming the Noise in Reinforcement Learning via Soft Updates
Roy Fox
Ari Pakman
Naftali Tishby
75
338
0
28 Dec 2015
Dueling Network Architectures for Deep Reinforcement Learning
Ziyun Wang
Tom Schaul
Matteo Hessel
H. V. Hasselt
Marc Lanctot
Nando de Freitas
OffRL
91
3,755
0
20 Nov 2015
Gradient Estimation Using Stochastic Computation Graphs
John Schulman
N. Heess
T. Weber
Pieter Abbeel
OffRL
136
393
0
17 Jun 2015
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman
Philipp Moritz
Sergey Levine
Michael I. Jordan
Pieter Abbeel
OffRL
99
3,414
0
08 Jun 2015
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
B. Scherrer
82
102
0
19 Nov 2010
1