Disentangling feature and lazy training in deep neural networks
Abstract
Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the meanfield limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh^{1/2} at initialization. By varying α and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazytraining regime, the dynamics is almost linear and the NTK barely changes after initialization. The featuretraining regime includes the meanfield formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, FashionMNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α* that scales as $\frac{1}{\sqrt{h}}$ . (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fullyconnected networks perform generally better in the lazytraining regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as $\delta F\sim 1/\sqrt{h}$ , leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensembleaveraging several networks that are trained independently. (iv) In the featuretraining regime we identify a time scale ${t}_{1}\sim \sqrt{h}\alpha $ , such that for t ≪ t_{1} the dynamics is linear. At t ∼ t_{1}, the output has grown by a magnitude $\sqrt{h}$ and the changes of the tangent kernel  ∆Θ  become significant. Ultimately, it follows $\vert \vert {\Delta}{\Theta}\vert \vert \sim {\left(\sqrt{h}\alpha \right)}^{a}$ for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings.
 Publication:

Journal of Statistical Mechanics: Theory and Experiment
 Pub Date:
 November 2020
 DOI:
 10.1088/17425468/abc4de
 arXiv:
 arXiv:1906.08034
 Bibcode:
 2020JSMTE2020k3301G
 Keywords:

 deep learning;
 machine learning;
 Computer Science  Machine Learning;
 Statistics  Machine Learning
 EPrint:
 minor revisions