Motif Caller: Sequence Reconstruction for Motif-Based DNA Storage
Abstract
DNA data storage is rapidly gaining traction as a long-term data archival solution, primarily due to its exceptional durability. Retrieving stored data relies on DNA sequencing, which involves a process called basecalling -- a typically costly and slow task that uses machine learning to map raw sequencing signals back to individual DNA bases (which are then translated into digital bits to recover the data). Current models for basecalling have been optimized for reading individual bases. However, with the advent of novel DNA synthesis methods tailored for data storage, there is significant potential for optimizing the reading process. In this paper, we focus on Motif-based DNA synthesis, where sequences are constructed from motifs -- groups of bases -- rather than individual bases. To enable efficient reading of data stored in DNA using Motif-based DNA synthesis, we designed Motif Caller, a machine learning model built to detect entire motifs within a DNA sequence, rather than individual bases. Motifs can also be detected from individually identified bases using a basecaller and then searching for motifs, however, such an approach is unnecessarily complex and slow. Building a machine learning model that directly identifies motifs allows to avoid the additional step of searching for motifs. It also makes use of the greater amount of features per motif, thus enabling finding the motifs with higher accuracy. Motif Caller significantly enhances the efficiency and accuracy of data retrieval in DNA storage based on Motif-Based DNA synthesis.
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.16074
- Bibcode:
- 2024arXiv241216074A
- Keywords:
-
- Computer Science - Other Computer Science;
- Quantitative Biology - Genomics