Typesafe Coordinate Systems in High-Throughput Sequencing Applications
Abstract
High-throughput sequencing file formats and tools encode coordinate intervals with respect to a reference sequence in at least four distinct, incompatible ways. Integrating data from and moving data between different formats has the potential to introduce subtle off-by-one errors. Here, we introduce the notion of typesafe coordinates: coordinate intervals are not only an integer pair, but members of a type class comprising four types: the Cartesian product of a zero or one basis, and an open or closed interval end. By leveraging the type system of statically and strongly-typed, compiled languages we can provide static guarantees that an entire class of error is eliminated. We provide a reference implementation in D as part of a larger work (dhtslib), and proofs of concept in Rust, OCaml, and Python. Exploratory implementations are available at https://github.com/blachlylab/typesafe-coordinates.
- Publication:
-
arXiv e-prints
- Pub Date:
- September 2022
- DOI:
- 10.48550/arXiv.2209.06603
- arXiv:
- arXiv:2209.06603
- Bibcode:
- 2022arXiv220906603G
- Keywords:
-
- Quantitative Biology - Genomics;
- Computer Science - Logic in Computer Science;
- Computer Science - Programming Languages;
- D.2.4;
- D.3.3;
- J.3
- E-Print:
- 14 pages, 3 figures. Code available at https://github.com/blachlylab/typesafe-coordinates