Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality
Abstract
Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than $p=100s$ of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting k=5 covariates from p=300 variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of $1-2\%$ in practice within minutes for $p=100$s or hours for $p=1,000$s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical datasets, we illustrate our approach's ability to derive interpretable principal components tractably at scale.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2020
- DOI:
- 10.48550/arXiv.2005.05195
- arXiv:
- arXiv:2005.05195
- Bibcode:
- 2020arXiv200505195B
- Keywords:
-
- Mathematics - Optimization and Control;
- Computer Science - Machine Learning;
- Mathematics - Statistics Theory;
- Statistics - Computation
- E-Print:
- Revision submitted to JMLR