Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

doi:10.48550/arXiv.cs/0309005

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than one per cent of the entire dataset.

Publication:

arXiv e-prints

Pub Date:

September 2003

DOI:

10.48550/arXiv.cs/0309005

arXiv:

arXiv:cs/0309005

Bibcode:

2003cs........9005S

Keywords:

Computer Science - Data Structures and Algorithms;
Quantitative Biology - Biomolecules;
H.3.1;
J.3

E-Print:

34 pages, 12 figures, 4 tables - Timings for experiments added upon referees' request, and a number of less substantial modifications made

NASA/ADS

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

Abstract