Learning in structured MDPs with convex cost function: improved regret bounds for inventory management, Shipra Agrawal; iDS2
From Oluwasanmi Koyejo
views
comments
From Oluwasanmi Koyejo
Abstract: The stochastic inventory control problem under censored demands is a fundamental problem in revenue and supply chain management. A simple class of policies called ``base-stock policies'' is known to be optimal for this problem, and further, the convexity of long-run average-cost under such policies has been established. In this work, we present a learning algorithm for the stochastic inventory control problem under lost sales penalty and positive lead times, when the demand distribution is a priori unknown. Our main result is a regret bound of O(L\sqrt{T}+D) for the algorithm, where T is the time horizon, L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Our results significantly improve the existing regret bounds for this problem. Notably, even though the state space of the underlying Markov Decision Process (MDP) in this problem is continuous and L-dimensional, our regret bounds depend linearly on L. Our techniques utilize convexity of the long-run average cost and a newly derived bound on `bias' of base-stock policies, to establish an almost blackbox connection between the problem of learning and optimization in such MDPs and stochastic convex bandit optimization. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.