Data stream fusion for accurate quantile tracking and analysis
Abstract
UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries that are fused into a new summary related to the union of the streams (or datasets) processed by the input summaries whilst preserving both the error and size guarantees provided by UDDSKETCH. This property of sketches, known as mergeability, enables parallel and distributed processing. We prove that UDDSKETCH is fully mergeable and introduce a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2021
- DOI:
- 10.48550/arXiv.2101.06758
- arXiv:
- arXiv:2101.06758
- Bibcode:
- 2021arXiv210106758C
- Keywords:
-
- Computer Science - Data Structures and Algorithms;
- Computer Science - Databases;
- Computer Science - Distributed;
- Parallel;
- and Cluster Computing