Does DQN Learn?
Abstract
For a reinforcement learning method to be useful, the policy it estimates in the limit must be superior to the initial guess, at least on average. In this work, we show that the widely used Deep Q-Network (DQN) fails to meet even this basic criterion, even when it gets to see all possible states and actions infinitely often (a condition that ensures tabular Q-learning's convergence to the optimal Q-value). Our work's key highlights are as follows. First, we numerically show that DQN generally has a non-trivial probability of producing a policy worse than the initial one. Second, we give a theoretical explanation for this behavior in the context of linear DQN, wherein we replace the neural network with a linear function approximation but retain DQN's other key ideas, such as experience replay, target network, and $\epsilon$-greedy exploration. Our main result is that the tail behaviors of linear DQN are governed by invariant sets of a deterministic differential inclusion, a set-valued generalization of a differential equation. Notably, we show that these invariant sets need not align with locally optimal policies, thus explaining DQN's pathological behaviors, such as convergence to sub-optimal policies and policy oscillation. We also provide a scenario where the limiting policy is always the worst. Our work addresses a longstanding gap in understanding the behaviors of Q-learning with function approximation and $\epsilon$-greedy exploration.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2022
- DOI:
- 10.48550/arXiv.2205.13617
- arXiv:
- arXiv:2205.13617
- Bibcode:
- 2022arXiv220513617G
- Keywords:
-
- Computer Science - Machine Learning;
- Mathematics - Optimization and Control;
- 93E35;
- 68Q32;
- I.2.0
- E-Print:
- 20 pages, 3 figures