Reading yesterday's news. Layout recognition by segmentation of historical newspaper pages

doi:10.48550/arXiv.2401.16845

Reading yesterday's news. Layout recognition by segmentation of historical newspaper pages

Newspapers are important sources for historians interested in past societies' cultural values, social structures, and their changes. Since the 19th century, newspapers have been widely available and spread regionally. Today, historical newspapers are digitized but unavailable in a separate metadata-enhanced form. Machine-readable metadata, however, is a prerequisite for a mass statistical analysis of this source. This paper focuses on parsing the complex layout of historic newspaper pages, which today's machines do not understand well. We argue for using neural networks, which require detailed annotated data in large numbers. Our Bonn newspaper dataset consists of 486 pages of the \textit{Kölnische Zeitung} from the years 1866 and 1924. We propose solving the newspaper-understanding problem by training a U-Net on our new dataset, which delivers satisfactory performance.

Publication:

arXiv e-prints

Pub Date:

January 2024

DOI:

10.48550/arXiv.2401.16845

arXiv:

arXiv:2401.16845

Bibcode:

2024arXiv240116845S

Keywords:

Computer Science - Digital Libraries

E-Print:

Dataset available at: https://gitlab.uni-bonn.de/digital-history/newspaper-dataset . Baseline code: https://github.com/NewspaperSegmentation/NewspaperImageSegmentation/tree/master

NASA/ADS

Reading yesterday's news. Layout recognition by segmentation of historical newspaper pages

Abstract