Bandit Principal Component Analysis
Abstract
We consider a partialfeedback variant of the wellstudied online PCA problem where a learner attempts to predict a sequence of $d$dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment's choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decisionmaking problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. We also prove datadependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$.
 Publication:

arXiv eprints
 Pub Date:
 February 2019
 arXiv:
 arXiv:1902.03035
 Bibcode:
 2019arXiv190203035K
 Keywords:

 Computer Science  Machine Learning;
 Statistics  Machine Learning