349

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Mathematics of Operations Research (MOR), 2013
Abstract

Given a Markov Decision Process (MDP) with nn states and mm actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge. We consider two variations of PI: Howard's PI that changes all the actions with a positive advantage, and Simplex-PI that only changes one action with maximal advantage. We show that Howard's PI terminates after at most $ n(m-1) \left \lceil \frac{1}{1-\gamma}\log \left( \frac{1}{1-\gamma} \right) \right \rceil $ iterations, improving by a factor O(logn)O(\log n) a result by Hansen et al. (2013), while Simplex-PI terminates after at most $ n(m-1) \left\lceil \frac{n}{1-\gamma} \log \left( \frac{n}{1-\gamma} \right)\right\rceil $ iterations, improving by a factor 2 a result by Ye (2011). Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor~γ\gamma. When the MDP is deterministic, we show that Simplex-PI terminates after at most $ 2 n^2 m (m-1) \lceil 2 (n-1) \log n \rceil \lceil 2 n \log n \rceil = O(n^4 m^2 \log^2 n) $ iterations, improving by a factor O(n)O(n) a bound obtained by Post and Ye (2012). We generalize this result to stochastic MDPs: given a measure of the maximal transient time τt\tau_t and the maximal time τr\tau_r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most $ n^2 m (m-1) \left(\lceil \tau_r \log (n \tau_r) \rceil +\lceil \tau_r \log (n \tau_t) \rceil \right) \lceil {\tau_t} \log (n (\tau_t+1)) \rceil = \tilde O ( n^2 \tau_t \tau_r m^2 ) $ iterations. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the MDP is weakly-communicating, we show that Simplex-PI and Howard's PI terminate after at most n(m1)(τtlognτt+τrlognτr)=O~(nm(τt+τr))n(m-1) \left( \lceil \tau_t \log n \tau_t \rceil + \lceil \tau_r \log n \tau_r \rceil \right) =\tilde O(nm (\tau_t+\tau_r)) iterations.

View on arXiv
Comments on this paper