ResearchTrend.AI
  • Papers
  • Communities
  • Organizations
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.16877
119
0
v1v2v3v4 (latest)

Distributional Reinforcement Learning with Dual Expectile-Quantile Regression

26 May 2023
Sami Jullien
Romain Deffayet
J. Renders
Paul T. Groth
Maarten de Rijke
    OOD
ArXiv (abs)PDFHTML
Main:8 Pages
4 Figures
Bibliography:2 Pages
3 Tables
Appendix:4 Pages
Abstract

Distributional reinforcement learning (RL) has proven useful in multiple benchmarks as it enables approximating the full distribution of returns and makes a better use of environment samples. The commonly used quantile regression approach to distributional RL -- based on asymmetric L1L_1L1​ losses -- provides a flexible and effective way of learning arbitrary return distributions. In practice, it is often improved by using a more efficient, hybrid asymmetric L1L_1L1​-L2L_2L2​ Huber loss for quantile regression. However, by doing so, distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean. Indeed, asymmetric L2L_2L2​ losses, corresponding to expectile regression, cannot be readily used for distributional temporal difference learning. Motivated by the efficiency of L2L_2L2​-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns. We prove that our approach approximately learns the correct return distribution, and we benchmark a practical implementation on a toy example and at scale. On the Atari benchmark, our approach matches the performance of the Huber-based IQN-1 baseline after 200200200M training frames but avoids distributional collapse and keeps estimates of the full distribution of returns.

View on arXiv
Comments on this paper