An Improved Approximation Algorithm for the Column Subset Selection Problem
Abstract
We consider the problem of selecting the best subset of exactly $k$ columns from an $m \times n$ matrix $A$. We present and analyze a novel two-stage algorithm that runs in $O(\min\{mn^2,m^2n\})$ time and returns as output an $m \times k$ matrix $C$ consisting of exactly $k$ columns of $A$. In the first (randomized) stage, the algorithm randomly selects $\Theta(k \log k)$ columns according to a judiciously-chosen probability distribution that depends on information in the top-$k$ right singular subspace of $A$. In the second (deterministic) stage, the algorithm applies a deterministic column-selection procedure to select and return exactly $k$ columns from the set of columns selected in the first stage. Let $C$ be the $m \times k$ matrix containing those $k$ columns, let $P_C$ denote the projection matrix onto the span of those columns, and let $A_k$ denote the best rank-$k$ approximation to the matrix $A$. Then, we prove that, with probability at least 0.8, $$ \FNorm{A - P_CA} \leq \Theta(k \log^{1/2} k) \FNorm{A-A_k}. $$ This Frobenius norm bound is only a factor of $\sqrt{k \log k}$ worse than the best previously existing existential result and is roughly $O(\sqrt{k!})$ better than the best previous algorithmic result for the Frobenius norm version of this Column Subset Selection Problem (CSSP). We also prove that, with probability at least 0.8, $$ \TNorm{A - P_CA} \leq \Theta(k \log^{1/2} k)\TNorm{A-A_k} + \Theta(k^{3/4}\log^{1/4}k)\FNorm{A-A_k}. $$ This spectral norm bound is not directly comparable to the best previously existing bounds for the spectral norm version of this CSSP. Our bound depends on $\FNorm{A-A_k}$, whereas previous results depend on $\sqrt{n-k}\TNorm{A-A_k}$; if these two quantities are comparable, then our bound is asymptotically worse by a $(k \log k)^{1/4}$ factor.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2008
- DOI:
- 10.48550/arXiv.0812.4293
- arXiv:
- arXiv:0812.4293
- Bibcode:
- 2008arXiv0812.4293B
- Keywords:
-
- Computer Science - Data Structures and Algorithms
- E-Print:
- 17 pages