Modelbased clustering of large networks
Abstract
We describe a network clustering framework, based on finite mixture models, that can be applied to discretevalued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent modelbased clustering work for networks, we introduce a more flexible modeling framework, improve the variationalapproximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the Esteps are augmented by a minorizationmaximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the modelbased clustering framework by applying it to a discretevalued network with more than 131,000 nodes and 17 billion edge variables.
 Publication:

arXiv eprints
 Pub Date:
 July 2012
 DOI:
 10.48550/arXiv.1207.0188
 arXiv:
 arXiv:1207.0188
 Bibcode:
 2012arXiv1207.0188V
 Keywords:

 Statistics  Computation;
 Computer Science  Social and Information Networks;
 Physics  Physics and Society;
 Statistics  Applications
 EPrint:
 Published in at http://dx.doi.org/10.1214/12AOAS617 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)