Fast Searching in Packed Strings

doi:10.48550/arXiv.0907.3135

Fast Searching in Packed Strings

Bille, Philip

Given strings $P$ and $Q$ the (exact) string matching problem is to find all positions of substrings in $Q$ matching $P$. The classical Knuth-Morris-Pratt algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear time which is optimal if we can only read one character at the time. However, most strings are stored in a computer in a packed representation with several characters in a single word, giving us the opportunity to read multiple characters simultaneously. In this paper we study the worst-case complexity of string matching on strings given in packed representation. Let $m \leq n$ be the lengths $P$ and $Q$, respectively, and let $\sigma$ denote the size of the alphabet. On a standard unit-cost word-RAM with logarithmic word size we present an algorithm using time $$ O\left(\frac{n}{\log_\sigma n} + m + \occ\right). $$ Here $\occ$ is the number of occurrences of $P$ in $Q$. For $m = o(n)$ this improves the $O(n)$ bound of the Knuth-Morris-Pratt algorithm. Furthermore, if $m = O(n/\log_\sigma n)$ our algorithm is optimal since any algorithm must spend at least $\Omega(\frac{(n+m)\log \sigma}{\log n} + \occ) = \Omega(\frac{n}{\log_\sigma n} + \occ)$ time to read the input and report all occurrences. The result is obtained by a novel automaton construction based on the Knuth-Morris-Pratt algorithm combined with a new compact representation of subautomata allowing an optimal tabulation-based simulation.

Publication:

arXiv e-prints

Pub Date:

July 2009

DOI:

10.48550/arXiv.0907.3135

arXiv:

arXiv:0907.3135

Bibcode:

2009arXiv0907.3135B

Keywords:

Computer Science - Data Structures and Algorithms

E-Print:

To appear in Journal of Discrete Algorithms. Special Issue on CPM 2009

NASA/ADS

Fast Searching in Packed Strings

Abstract