Pruned Wasserstein Index Generation Model and wigpy Package
Abstract
Recent proposal of Wasserstein Index Generation model (WIG) has shown a new direction for automatically generating indices. However, it is challenging in practice to fit large datasets for two reasons. First, the Sinkhorn distance is notoriously expensive to compute and suffers from dimensionality severely. Second, it requires to compute a full $N\times N$ matrix to be fit into memory, where $N$ is the dimension of vocabulary. When the dimensionality is too large, it is even impossible to compute at all. I hereby propose a Lassobased shrinkage method to reduce dimensionality for the vocabulary as a preprocessing step prior to fitting the WIG model. After we get the word embedding from Word2Vec model, we could cluster these highdimensional vectors by $k$means clustering, and pick most frequent tokens within each cluster to form the "base vocabulary". Nonbase tokens are then regressed on the vectors of base token to get a transformation weight and we could thus represent the whole vocabulary by only the "base tokens". This variant, called pruned WIG (pWIG), will enable us to shrink vocabulary dimension at will but could still achieve high accuracy. I also provide a \textit{wigpy} module in Python to carry out computation in both flavor. Application to Economic Policy Uncertainty (EPU) index is showcased as comparison with existing methods of generating timeseries sentiment indices.
 Publication:

arXiv eprints
 Pub Date:
 March 2020
 DOI:
 10.48550/arXiv.2004.00999
 arXiv:
 arXiv:2004.00999
 Bibcode:
 2020arXiv200400999X
 Keywords:

 Computer Science  Machine Learning;
 Computer Science  Computation and Language;
 Economics  General Economics
 EPrint:
 fix typos and errors