Why Random Reshuffling Beats Stochastic Gradient Descent
Abstract
We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized firstorder incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its withreplacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by showing that when the component functions are quadratics or smooth and the sum function is strongly convex, RR with iterate averaging and a diminishing stepsize $\alpha_k=\Theta(1/k^s)$ for $s\in (1/2,1)$ converges at rate $\Theta(1/k^{2s})$ with probability one in the suboptimality of the objective value, thus improving upon the $\Omega(1/k)$ rate of SGD. Our analysis draws on the theory of PolyakRuppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by $\alpha_k^2$. This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate ${\cal O}(\frac{1}{k^2})$.
 Publication:

arXiv eprints
 Pub Date:
 October 2015
 arXiv:
 arXiv:1510.08560
 Bibcode:
 2015arXiv151008560G
 Keywords:

 Mathematics  Optimization and Control
 EPrint:
 Mathematical Programming, 2019