Efficient estimation of the cardinality of large data sets
Abstract
F.Giroire has recently proposed an algorithm which returns the approximate number of distincts elements in a large sequence of words, under strong constraints coming from the analysis of large data bases. His estimation is based on statistical properties of uniform random variables in $[0,1]$. In this note we propose an optimal estimation, using Kullback information and estimation theory.
- Publication:
-
arXiv Mathematics e-prints
- Pub Date:
- January 2007
- arXiv:
- arXiv:math/0701347
- Bibcode:
- 2007math......1347C
- Keywords:
-
- Mathematics - Statistics Theory;
- Mathematics - Probability
- E-Print:
- Extended and improved version of the published paper