Representing the suffix tree with the CDAWG
Abstract
Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{\overline{T}})$ words of space, where ${\overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(\log{\log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(\log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(\log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2017
- DOI:
- 10.48550/arXiv.1705.08640
- arXiv:
- arXiv:1705.08640
- Bibcode:
- 2017arXiv170508640B
- Keywords:
-
- Computer Science - Data Structures and Algorithms
- E-Print:
- 16 pages, 1 figure. Presented at the 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)