Metric Dimension of Hamming Graphs and Applications to Computational Biology
Abstract
Genetic sequencing has become an increasingly affordable and accessible source of genomic data in computational biology. This data is often represented as $k$-mers, i.e., strings of some fixed length $k$ with symbols chosen from a reference alphabet. In contrast, some of the most effective and well-studied machine learning algorithms require numerical representations of the data. The concept of metric dimension of the so-called Hamming graphs presents a promising way to address this issue. A subset of vertices in a graph is said to be resolving when the distances to those vertices uniquely characterize every vertex in the graph. The metric dimension of a graph is the size of a smallest resolving subset of vertices. Finding the metric dimension of a general graph is a challenging problem, NP-complete in fact. Recently, an efficient algorithm for finding resolving sets in Hamming graphs has been proposed, which suffices to uniquely embed $k$-mers into a real vector space. Since the dimension of the embedding is the cardinality of the associated resolving set, determining whether or not a node can be removed from a resolving set while keeping it resolving is of great interest. This can be quite challenging for large graphs since only a brute-force approach is known for checking whether a set is a resolving set or not. In this thesis, we characterize resolvability of Hamming graphs in terms of a linear system over a finite domain: a set of nodes is resolving if and only if the linear system has only a trivial solution over said domain. We can represent the domain as the roots of a polynomial system so the apparatus of Gröbner bases comes in handy to determine, whether or not a set of nodes is resolving. As proof of concept, we study the resolvability of Hamming graphs associated with octapeptides i.e. proteins sequences of length eight.
- Publication:
-
arXiv e-prints
- Pub Date:
- July 2020
- DOI:
- 10.48550/arXiv.2007.01337
- arXiv:
- arXiv:2007.01337
- Bibcode:
- 2020arXiv200701337L
- Keywords:
-
- Mathematics - Combinatorics;
- Quantitative Biology - Quantitative Methods