Reading yesterday's news. Layout recognition by segmentation of historical newspaper pages
Abstract
Newspapers are important sources for historians interested in past societies' cultural values, social structures, and their changes. Since the 19th century, newspapers have been widely available and spread regionally. Today, historical newspapers are digitized but unavailable in a separate metadata-enhanced form. Machine-readable metadata, however, is a prerequisite for a mass statistical analysis of this source. This paper focuses on parsing the complex layout of historic newspaper pages, which today's machines do not understand well. We argue for using neural networks, which require detailed annotated data in large numbers. Our Bonn newspaper dataset consists of 486 pages of the \textit{Kölnische Zeitung} from the years 1866 and 1924. We propose solving the newspaper-understanding problem by training a U-Net on our new dataset, which delivers satisfactory performance.
- Publication:
-
arXiv e-prints
- Pub Date:
- January 2024
- DOI:
- 10.48550/arXiv.2401.16845
- arXiv:
- arXiv:2401.16845
- Bibcode:
- 2024arXiv240116845S
- Keywords:
-
- Computer Science - Digital Libraries
- E-Print:
- Dataset available at: https://gitlab.uni-bonn.de/digital-history/newspaper-dataset . Baseline code: https://github.com/NewspaperSegmentation/NewspaperImageSegmentation/tree/master