Reinforcement Learning for MultiObjective Optimization of Online Decisions in HighDimensional Systems
Abstract
This paper describes a purely datadriven solution to a class of sequential decisionmaking problems with a large number of concurrent online decisions, with applications to computing systems and operations research. We assume that while the microlevel behaviour of the system can be broadly captured by analytical expressions or simulation, the macrolevel or emergent behaviour is complicated by nonlinearity, constraints, and stochasticity. If we represent the set of concurrent decisions to be computed as a vector, each element of the vector is assumed to be a continuous variable, and the number of such elements is arbitrarily large and variable from one problem instance to another. We first formulate the decisionmaking problem as a canonical reinforcement learning (RL) problem, which can be solved using purely datadriven techniques. We modify a standard approach known as advantage actor critic (A2C) to ensure its suitability to the problem at hand, and compare its performance to that of baseline approaches on the specific instance of a multiproduct inventory management task. The key modifications include a parallelised formulation of the decisionmaking task, and a training procedure that explicitly recognises the quantitative relationship between different decisions. We also present experimental results probing the learned policies, and their robustness to variations in the data.
 Publication:

arXiv eprints
 Pub Date:
 October 2019
 arXiv:
 arXiv:1910.00211
 Bibcode:
 2019arXiv191000211M
 Keywords:

 Computer Science  Artificial Intelligence;
 Computer Science  Machine Learning;
 Electrical Engineering and Systems Science  Systems and Control
 EPrint:
 22 pages, 10 figures