The kmismatch problem revisited
Abstract
We revisit the complexity of one of the most basic problems in pattern matching. In the kmismatch problem we must compute the Hamming distance between a pattern of length m and every mlength substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming kmismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic $O(n k^2\log{k} / m+n \text{polylog} m)$ time offline algorithm for kmismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only $O(k^2\text{polylog} {m})$ space in total. 3) Next we give a randomised $(1+\epsilon)$approximation algorithm for the streaming kmismatch problem which uses $O(k^2\text{polylog} m / \epsilon^2)$ space and runs in $O(\text{polylog} m / \epsilon^2)$ worstcase time per arriving symbol. 4) Finally we combine our new results to derive a randomised $O(k^2\text{polylog} {m})$ space algorithm for the streaming kmismatch problem which runs in $O(\sqrt{k}\log{k} + \text{polylog} {m})$ worstcase time per arriving symbol. This improves the best previous space complexity for streaming kmismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors).
 Publication:

arXiv eprints
 Pub Date:
 August 2015
 arXiv:
 arXiv:1508.00731
 Bibcode:
 2015arXiv150800731C
 Keywords:

 Computer Science  Data Structures and Algorithms