Looking at 3,000,000 References Without Growing Grey Hair
Abstract
The article service of the Astrophysics Data System (ADS, http://adswww.harvard.edu) currently holds about 500,000 pages scanned from astronomical journals and conference proceedings. This data set not only facilitates an easy and convenient access to the majority of the astronomical literature from anywhere on the Internet but also allows highly automatized extraction of the information contained in the articles. As first steps towards processing and indexing the full texts of the articles, the ADS has been extracting abstracts and references from the bitmap images of the articles since May 1999. In this poster we describe the procedures and strategies to (a) automatically identify the regions within a paper containing the abstract or the references, (b) spot and correct errors in the data base or the identification of the regions, (c) resolve references obtained by optical character recognition (OCR) with its inherent uncertainties to parsed references (i.e., bibcodes) and (d) incorporate the data collected in this way into the ADS abstract service. We also give an overview of the extent of additional bibliographical material from this source. We estimate that by January 2000, these procedures will have yielded about 14,000 abstracts and 1,000,000 citation pairs (out of a total of 3,000,000 references) not previously present in the ADS.
- Publication:
-
American Astronomical Society Meeting Abstracts
- Pub Date:
- December 1999
- Bibcode:
- 1999AAS...195.8209D