Data efficiency, dimensionality reduction, and the generalized symmetric information bottleneck
Abstract
The Symmetric Information Bottleneck (SIB), an extension of the more familiar Information Bottleneck, is a dimensionality reduction technique that simultaneously compresses two random variables to preserve information between their compressed versions. We introduce the Generalized Symmetric Information Bottleneck (GSIB), which explores different functional forms of the cost of such simultaneous reduction. We then explore the dataset size requirements of such simultaneous compression. We do this by deriving bounds and root-mean-squared estimates of statistical fluctuations of the involved loss functions. We show that, in typical situations, the simultaneous GSIB compression requires qualitatively less data to achieve the same errors compared to compressing variables one at a time. We suggest that this is an example of a more general principle that simultaneous compression is more data efficient than independent compression of each of the input variables.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2023
- DOI:
- 10.48550/arXiv.2309.05649
- arXiv:
- arXiv:2309.05649
- Bibcode:
- 2023arXiv230905649M
- Keywords:
-
- Computer Science - Information Theory;
- Condensed Matter - Statistical Mechanics;
- Computer Science - Machine Learning;
- Physics - Data Analysis;
- Statistics and Probability