Phase transition in the computational complexity of the shortest common superstring and genome assembly
Abstract
Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to handle datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the "easy" phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.
- Publication:
-
Physical Review E
- Pub Date:
- January 2024
- DOI:
- 10.1103/PhysRevE.109.014133
- arXiv:
- arXiv:2210.09986
- Bibcode:
- 2024PhRvE.109a4133F
- Keywords:
-
- Condensed Matter - Statistical Mechanics;
- Computer Science - Computational Complexity;
- Physics - Biological Physics;
- Quantitative Biology - Genomics
- E-Print:
- 9 pages, 5 figures. Version accepted for publication in Phys. Rev. E