Leveraging Initial Hints for Free in Stochastic Linear Bandits
Abstract
We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimaxoptimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between bestcase and worstcase regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worstcase performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.
 Publication:

arXiv eprints
 Pub Date:
 March 2022
 arXiv:
 arXiv:2203.04274
 Bibcode:
 2022arXiv220304274C
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Data Structures and Algorithms
 EPrint:
 ALT 2022