Scalable MCMC for Large Data Problems using Data Subsampling and the Difference Estimator
Abstract
We propose a generic Markov Chain Monte Carlo (MCMC) algorithm to speed up computations for datasets with many observations. A key feature of our approach is the use of the highly efficient difference estimator from the survey sampling literature to estimate the log-likelihood accurately using only a small fraction of the data. Our algorithm improves on the $O(n)$ complexity of regular MCMC by operating over local data clusters instead of the full sample when computing the likelihood. The likelihood estimate is used in a Pseudo-marginal framework to sample from a perturbed posterior which is within $O(m^{-1/2})$ of the true posterior, where $m$ is the subsample size. The method is applied to a logistic regression model to predict firm bankruptcy for a large data set. We document a significant speed up in comparison to the standard MCMC on the full dataset.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2015
- DOI:
- 10.48550/arXiv.1507.02971
- arXiv:
- arXiv:1507.02971
- Bibcode:
- 2015arXiv150702971Q
- Keywords:
-
- Statistics - Methodology;
- Statistics - Computation;
- Statistics - Machine Learning
- E-Print:
- The content in this paper is now in arXiv:1404.4178, as a result of a major revision of that paper