Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency
Abstract
We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OPTENET) that attains an $\epsilon$optimal policy within $O(1/\epsilon^2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OPTENET is enabled by a sequence of ingredients: (i) a Bellman operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.
 Publication:

arXiv eprints
 Pub Date:
 April 2022
 arXiv:
 arXiv:2204.09787
 Bibcode:
 2022arXiv220409787C
 Keywords:

 Computer Science  Machine Learning;
 Mathematics  Optimization and Control;
 Statistics  Machine Learning