Efficient Principal Subspace Projection of Streaming Data Through Fast Similarity Matching

doi:10.48550/arXiv.1808.02083

Efficient Principal Subspace Projection of Streaming Data Through Fast Similarity Matching

Big data problems frequently require processing datasets in a streaming fashion, either because all data are available at once but collectively are larger than available memory or because the data intrinsically arrive one data point at a time and must be processed online. Here, we introduce a computationally efficient version of similarity matching, a framework for online dimensionality reduction that incrementally estimates the top K-dimensional principal subspace of streamed data while keeping in memory only the last sample and the current iterate. To assess the performance of our approach, we construct and make public a test suite containing both a synthetic data generator and the infrastructure to test online dimensionality reduction algorithms on real datasets, as well as performant implementations of our algorithm and competing algorithms with similar aims. Among the algorithms considered we find our approach to be competitive, performing among the best on both synthetic and real data.

Publication:

arXiv e-prints

Pub Date:

August 2018

DOI:

10.48550/arXiv.1808.02083

arXiv:

arXiv:1808.02083

Bibcode:

2018arXiv180802083G

Keywords:

Statistics - Computation;
Computer Science - Machine Learning

E-Print:

9 pages, 4 figures

NASA/ADS

Efficient Principal Subspace Projection of Streaming Data Through Fast Similarity Matching

Abstract