Risk management is critical in decision-making, and mean-variance (MV) trade-off is one of the most common criteria. However, in reinforcement learning (RL) under a dynamic environment, MV control is not as easy as that under a static environment owing to computational difficulties. For MV controlled RL, this paper proposes direct expected quadratic utility maximization (EQUM), where a mean-variance efficient agent is given as its solution. This approach does not only avoid computational difficulties but also improves empirical performances. In experiments, we demonstrate the effectiveness of the proposed EQUM with benchmark settings.
View on arXiv