Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

doi:10.48550/arXiv.1405.5873

Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the $L_2$-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.

Publication:

arXiv e-prints

Pub Date:

May 2014

DOI:

10.48550/arXiv.1405.5873

arXiv:

arXiv:1405.5873

Bibcode:

2014arXiv1405.5873V

Keywords:

Statistics - Machine Learning;
Computer Science - Data Structures and Algorithms;
Computer Science - Information Theory

E-Print:

25 pages, 20 figures, accepted in VLDB

NASA/ADS

Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

Abstract