Reliable Editions from Unreliable Components: Estimating Ebooks from Print Editions Using Profile Hidden Markov Models
Abstract
A profile hidden Markov model, a popular model in biological sequence analysis, can be used to model related sequences of characters transcribed from books, magazines, and other printed materials. This paper documents one application of a profile HMM: automatically producing an ebook edition from distinct print editions. The resulting ebook has virtually all the desired properties found in a publisher-prepared ebook, including accurate transcription and an absence of print artifacts such as end-of-line hyphenation and running headers. The technique, which has particular benefits for readers and libraries that require books in an accessible format, is demonstrated using seven copies of a nineteenth-century novel.
- Publication:
-
arXiv e-prints
- Pub Date:
- April 2022
- DOI:
- 10.48550/arXiv.2204.01638
- arXiv:
- arXiv:2204.01638
- Bibcode:
- 2022arXiv220401638R
- Keywords:
-
- Computer Science - Digital Libraries
- E-Print:
- In Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries (JCDL '22)