Learning OverParametrized TwoLayer ReLU Neural Networks beyond NTK
Abstract
We consider the dynamic of gradient descent for learning a twolayer neural network. We assume the input $x\in\mathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{\star}(x) = a^{\top}W^{\star}x$, where $a\in\mathbb{R}^d$ is a nonnegative vector and $W^{\star} \in\mathbb{R}^{d\times d}$ is an orthonormal matrix. We show that an overparametrized twolayer neural network with ReLU activation, trained by gradient descent from random initialization, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $\Omega(1 / d)$.
 Publication:

arXiv eprints
 Pub Date:
 July 2020
 arXiv:
 arXiv:2007.04596
 Bibcode:
 2020arXiv200704596L
 Keywords:

 Computer Science  Machine Learning;
 Mathematics  Optimization and Control;
 Statistics  Machine Learning
 EPrint:
 Conference on Learning Theory (COLT) 2020