Finding Structure in Text, Genome and Other Symbolic Sequences
Abstract
The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here. A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art. Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2012
- DOI:
- 10.48550/arXiv.1207.1847
- arXiv:
- arXiv:1207.1847
- Bibcode:
- 2012arXiv1207.1847D
- Keywords:
-
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval
- E-Print:
- ~ 176 pages, many figures