Dimensionality Reduction of Vector Space Model for Information Retrieval using Simple Principal Component Analysis
Abstract
In this paper, we propose to use the Simple Principal Component Analysis (SPCA) for dimensionality reduction of the vector space information retrieval model. The SPCA algorithm is a data-oriented fast method which does not require the computation of the variance-covariance matrix. In SPCA, principal components are estimated iteratively so we also propose a criteria to determine the convergence. The optimum number of iterations for each principal component can be determined using the criteria. Experimentally, we show that the SPCA-based method offers improvement over the conventional SVD-based method despite its small amount of computation. This advantage of SPCA can be attributed to its iterative procedure which is similar to clustering methods such as k-means clustering. On the other hand, the proposed method which orthogonalizes the basis vectors also achieved much higher accuracy than the conventional random projection method based on k-means clustering.
- Publication:
-
IEEJ Transactions on Electronics, Information and Systems
- Pub Date:
- 2005
- DOI:
- Bibcode:
- 2005ITEIS.125.1773K
- Keywords:
-
- Information retrieval;
- Vector Space Model;
- Dimensionality Reduction;
- Simple PCA;
- clustering