The loss surface of deep and wide neural networks
Abstract
While the optimization problem behind deep neural networks is highly nonconvex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.
 Publication:

arXiv eprints
 Pub Date:
 April 2017
 arXiv:
 arXiv:1704.08045
 Bibcode:
 2017arXiv170408045N
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Artificial Intelligence;
 Computer Science  Computer Vision and Pattern Recognition;
 Computer Science  Neural and Evolutionary Computing;
 Statistics  Machine Learning
 EPrint:
 ICML 2017. Main results now hold for larger classes of loss functions