6 The articles

The ADS is able to scan and provide free access to past issues of the astronomical journals because of the willing collaboration of the journal publishers. The primary reason that the journal publishers have agreed to allow the scanning of their old volumes is that the loss of individual subscriptions does not pose a threat to their livelihood. Unlike many disciplines, most astronomy journals are able to pay for their publications through the cost of page charges to astronomers who write the articles and through library subscriptions which are unlikely to be cancelled in spite of free access to older volumes through the ADS. The journal publishers continue to charge for access to the current volumes, which is paid for by most institutional libraries. This arrangement places astronomers in a fortunate position for electronic accessibility of astronomy articles.

The original electronic publishing plans for the astronomical community called for STELAR (STudy of Electronic Literature for Astronomical Research, [van Steenberg 1992]; [van Steenberg et al. 1992]; [Warnock et al. 1992]; [Warnock et al. 1993]) to handle the scanning and dissemination of the full journal articles. However, when the STELAR project was terminated in 1993, the ADS assumed responsibility for providing scanned full journal articles to the astronomical community. The first test journal to be scanned was the ApJ Letters which was scanned in January, 1995 at 300 dots per inch (dpi). It should be noted that those scans were intended to be 600 dpi and we will soon rescan them at the higher 600 dpi resolution. Complications in the journal publishing format (plates at the end of some volumes and in the middle of others) were noted and detailed instructions provided to the scanning company so that the resulting scans would be named properly by page or plate number.

All of the scans since the original test batch have been scanned at 600 dpi using a high speed scanner and generating a 1 bit/pixel monochrome image for each page. The files created are then automatically processed in order to de-skew and center the text in each page, resize images to a standard U.S. Letter size (8.5 $\times$ 11 inches), and add a copyright notice at the bottom of each page. For each original scanned page, two separate image files of different resolutions are generated and stored on disk. The availability of different resolutions allows users the flexibility of downloading either high or medium quality documents, depending on the speed of their internet connection. The image formats and compression used were chosen based on the available compression algorithms and browser capabilities. The high resolution files currently used are 600 dpi, 1 bit/pixel TIFF (Tagged Image File Format) files, compressed using the CCITT Group 4 facsimile encoding algorithm. The medium resolution files are 200 dpi, 1 bit/pixel TIFF files, also with CCITT Group 4 facsimile compression.

Conversion to printing formats (PDF, PCL, and Postscript) is done on demand, as requested by the user. Similarly, conversion from the TIFF files to a low resolution GIF (Graphic Interchange Format) file (75, 100, or 150 dpi, depending on user preferences) for viewing on the computer screen is done on demand, then cached so that the most frequently accessed pages do not need to be created every time. A procedure run nightly deletes the GIF files with the oldest access time stamp so that the total size of the disk cache is kept under a pre-defined limit. The current 10 GBytes of cache size in use at the SAO Article Server causes only files which have not been accessed for about a month to be deleted. Like the full-screen GIF images, the ADS also caches thumbnail images of the article pages which provide users with the capability of viewing the entire article at a glance.

The ADS uses Optical Character Recognition (OCR) software to gain additional data from TIFF files of article scans. The OCR software is not yet adequate for accurate reproduction of the scanned pages. Greek symbols, equations, charts, and tables do not translate accurately enough to remain true to the original printed page. For this reason, we have chosen not to display to the user anything rendered by the OCR software in an unsupervised fashion. However, we are still able to take advantage of the OCR software for several purposes.

First, we are able to identify and extract the abstract paragraph(s) for use when we do not have the abstract from another source. In these cases, the OCR'd text is indexed so that it is searchable and the extracted image of the abstract paragraph is displayed in lieu of an ASCII version of the abstract. Extracting the abstract from the scanned pages is somewhat tedious, as it requires establishing different sets of parameters for each journal, as well as for different fonts used over the years by the same journal. The OCR software can be taught how to determine where the abstract ends, but it does not work for every article due to oddities such as author lists which extend beyond the first page of an article, and articles which are in a different format from others in the same volume (e.g. no keywords or multiple columns). The ADS currently contains approximately 25 000 of these abstract images and more will be added as we continue to scan the historical literature.

We are also currently using the OCR software to render electronic versions of the entire scanned articles for indexing purposes. We will not use this for display to the users, but hope to be able to index it to provide the possibility of full text searching at some future date. We estimate that the indexing of our almost one million scanned pages with our current hardware and software will take approximately two years of dedicated CPU time.

The last benefit that we gain from the OCR software is the conversion of the reference list at the end of articles. We use parsed reference lists from the scanned articles to build citation and reference lists for display through the C and R links of the available items. Since reference lists are typically in one of several standard formats, we parse each reference for author, journal, volume and page number for most journal articles, and conference name, author, and page number for many conference proceedings. This enables us to build bibliographic code lists for references contained in that article (R links) and invert these lists to build bibliographic code lists of articles which cite this paper (C links). We are able to use this process to identify and therefore add commonly-cited articles which are currently missing from the ADS. This is usually data prior to 1975 or astronomy-related articles published in non-astronomy journals.

The Article Service currently contains 250 GBytes of scans, which consists of 1 128 955 article pages comprising 138 789 articles. These numbers increase on a regular basis, both as we add more articles from the older literature and as we scan new journals.

Up: The NASA Astrophysics Data