A Convergence Theory for Deep Learning via OverParameterization
Abstract
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multilayer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are nondegenerate and the network is overparameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almostconvex and semismooth even with ReLU activations. This implies an equivalence between overparameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widelyused but nonsmooth ReLU activation, and to any smooth and possibly nonconvex loss functions. In terms of network architectures, our theory at least applies to fullyconnected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
 Publication:

arXiv eprints
 Pub Date:
 November 2018
 arXiv:
 arXiv:1811.03962
 Bibcode:
 2018arXiv181103962A
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Data Structures and Algorithms;
 Computer Science  Neural and Evolutionary Computing;
 Mathematics  Optimization and Control;
 Statistics  Machine Learning
 EPrint:
 V2 adds citation and V3/V4/V5 polish writing