Training Faster by Separating Modes of Variation in Batch-normalized Models

doi:10.48550/arXiv.1806.02892

Training Faster by Separating Modes of Variation in Batch-normalized Models

Batch Normalization (BN) is essential to effectively train state-of-the-art deep Convolutional Neural Networks (CNN). It normalizes inputs to the layers during training using the statistics of each mini-batch. In this work, we study BN from the viewpoint of Fisher kernels. We show that assuming samples within a mini-batch are from the same probability density function, then BN is identical to the Fisher vector of a Gaussian distribution. That means BN can be explained in terms of kernels that naturally emerge from the probability density function of the underlying data distribution. However, given the rectifying non-linearities employed in CNN architectures, distribution of inputs to the layers show heavy tail and asymmetric characteristics. Therefore, we propose approximating underlying data distribution not with one, but a mixture of Gaussian densities. Deriving Fisher vector for a Gaussian Mixture Model (GMM), reveals that BN can be improved by independently normalizing with respect to the statistics of disentangled sub-populations. We refer to our proposed soft piecewise version of BN as Mixture Normalization (MN). Through extensive set of experiments on CIFAR-10 and CIFAR-100, we show that MN not only effectively accelerates training image classification and Generative Adversarial networks, but also reaches higher quality models.

Publication:

arXiv e-prints

Pub Date:

June 2018

DOI:

10.48550/arXiv.1806.02892

arXiv:

arXiv:1806.02892

Bibcode:

2018arXiv180602892K

Keywords:

Computer Science - Machine Learning;
Computer Science - Computer Vision and Pattern Recognition;
Statistics - Machine Learning

NASA/ADS

Training Faster by Separating Modes of Variation in Batch-normalized Models

Abstract