Lyndon Words and Short Superstrings
Abstract
In the ShortestSuperstring problem, we are given a set of strings S and want to find a string that contains all strings in S as substrings and has minimum length. This is a classical problem in approximation and the best known approximation factor is 2 1/2, given by Sweedyk in 1999. Since then no improvement has been made, howerever two other approaches yielding a 2 1/2approximation algorithms have been proposed by Kaplan et al. and recently by Paluch et al., both based on a reduction to maximum asymmetric TSP path (MaxATSPPath) and structural results of Breslauer et al. In this paper we give an algorithm that achieves an approximation ratio of 2 11/23, breaking through the longstanding bound of 2 1/2. We use the standard reduction of ShortestSuperstring to MaxATSPPath. The new, somewhat surprising, algorithmic idea is to take the better of the two solutions obtained by using: (a) the currently best 2/3approximation algorithm for MaxATSPPath and (b) a naive cyclecover based 1/2approximation algorithm. To prove that this indeed results in an improvement, we further develop a theory of string overlaps, extending the results of Breslauer et al. This theory is based on the novel use of Lyndon words, as a substitute for generic unbordered rotations and critical factorizations, as used by Breslauer et al.
 Publication:

arXiv eprints
 Pub Date:
 May 2012
 arXiv:
 arXiv:1205.6787
 Bibcode:
 2012arXiv1205.6787M
 Keywords:

 Computer Science  Data Structures and Algorithms