Statistical Inference of a canonical dictionary of protein substructural fragments
Abstract
Proteins are biomolecules of life. They fold into a great variety of three-dimensional (3D) shapes. Underlying these folding patterns are many recurrent structural fragments or building blocks (analogous to `LEGO bricks'). This paper reports an innovative statistical inference approach to discover a comprehensive dictionary of protein structural building blocks from a large corpus of experimentally determined protein structures. Our approach is built on the Bayesian and information-theoretic criterion of minimum message length. To the best of our knowledge, this work is the first systematic and rigorous treatment of a very important data mining problem that arises in the cross-disciplinary area of structural bioinformatics. The quality of the dictionary we find is demonstrated by its explanatory power -- any protein within the corpus of known 3D structures can be dissected into successive regions assigned to fragments from this dictionary. This induces a novel one-dimensional representation of three-dimensional protein folding patterns, suitable for application of the rich repertoire of character-string processing algorithms, for rapid identification of folding patterns of newly-determined structures. This paper presents the details of the methodology used to infer the dictionary of building blocks, and is supported by illustrative examples to demonstrate its effectiveness and utility.
- Publication:
-
arXiv e-prints
- Pub Date:
- October 2013
- DOI:
- 10.48550/arXiv.1310.1462
- arXiv:
- arXiv:1310.1462
- Bibcode:
- 2013arXiv1310.1462K
- Keywords:
-
- Quantitative Biology - Quantitative Methods;
- Quantitative Biology - Biomolecules
- E-Print:
- 17 pages, 3 Figures (Accepted for publication as a short paper in the 'The thirteenth International Conference on Data Mining (ICDM '13