Near-Optimal Bounds for Testing Histogram Distributions
Abstract
We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, $k$-histograms over $[n]$, are probability distributions that are piecewise constant over a set of $k$ intervals. The histogram testing problem is the following: Given samples from an unknown distribution $\mathbf{p}$ on $[n]$, we want to distinguish between the cases that $\mathbf{p}$ is a $k$-histogram versus $\varepsilon$-far from any $k$-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound. Specifically, we show that the histogram testing problem has sample complexity $\widetilde \Theta (\sqrt{nk} / \varepsilon + k / \varepsilon^2 + \sqrt{n} / \varepsilon^2)$.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2022
- DOI:
- 10.48550/arXiv.2207.06596
- arXiv:
- arXiv:2207.06596
- Bibcode:
- 2022arXiv220706596C
- Keywords:
-
- Computer Science - Data Structures and Algorithms;
- Computer Science - Machine Learning;
- Mathematics - Statistics Theory