5 Updating the database

The software to update bibliographic records in the database consists of a series of PERL scripts, typically one per data source, which reads in the data, performs any special processing particular to that data source, and writes out the data to text files. The loading routines perform three fundamental tasks: 1) they add new bibliographic codes to the current master list of bibliographic codes in the system; 2) they create and organize the text files containing the reference data; and 3) they maintain the lists of bibliographic codes used to indicate what items are available for a given reference.

5.1 The master list

The master list is a table containing bibliographic codes together with their publication dates (YYYYMM) and entry dates into the system (YYYYMMDD). There is one master list per database with one line per reference. The most important aspect of the master list is that it retains information about "alternative" bibliographic codes and matches them to their corresponding preferred bibliographic code. An alternative bibliographic code is usually a reference which we receive from another source (primarily SIMBAD or NED) which has been assigned a different bibliographic code from the one used by the ADS. Sometimes this is due to the different rules used to build bibliographic codes for non-standard publications (see Sect. 3.1), but often it is just an incorrect year, volume, page, or author initial in one of the databases (SIMBAD or NED or the ADS). In either case, the ADS must keep the alternative bibliographic code in the system so that it can be found when referenced by the other source (e.g. when SIMBAD sends back a list of their codes related to an object). The ADS matches the alternative bibliographic code to our corresponding one and replaces any instances of the alternative code when referenced by the other data source. Alternative bibliographic codes in the master list are prepended with an identification letter (S for SIMBAD, N for NED, J for Journal) so that their origin is retained. While we make every effort to propagate corrections back to our data sources, sometimes there is simply a valid discrepancy. For example, alternative bibliographic codes are often different from the ADS bibliographic code due to ambiguous differences such as which name is the surname of a Chinese author. Since Americans tend to invert Chinese names one way (Zheng, Wei) and Europeans another (Wei, Zheng), this results in two different, but equally valid codes. Similarly, discrepancies in journal names such as BAAS (for the published abstracts in the Bulletin of the American Astronomical Society) and AAS (for the equivalent abstract with meeting and session number, but no volume or page number) need different codes to refer to the same paper. Russian and Chinese translation journals (Astronomicheskii Zhurnal vs. Soviet Astronomy and Acta Astronomica Sinica vs. Chinese Astronomy and Astrophysics) share the same problem. These papers appear once in the foreign journal and once in the translation journal (usually with different page numbers), but are actually the same paper which should be in the system only once. The ADS must therefore maintain multiple bibliographic codes for the same article since each journal has its own abbreviation, and queries for either one must be able to be recognized. The master list is the source of this correlation and enables the indexing procedures and search engine to recognize alternative bibliographic codes.

5.2 The text files

Text files in the ADS are stored in a directory tree by bibliographic code. The top level of directories is divided into directories with four-digit names by publication year (characters 1 through 4 of the bibliographic code). The next level contains directories with five-character names according to journal (characters 5 through 9), and the text files are named by full bibliographic code under these journal directories. Thus, a sample pathname is 1998/MNRAS/1998MNRAS.295...75E. Alternative bibliographic codes do not have a text file named by that code, since the translation to the equivalent preferred bibliographic code is done prior to accessing the text file.

A sample text file is given in the appendices. Appendix B shows the full bibliographic entry, including all records as received from STI, MNRAS, and SIMBAD. It contains XML-tagged fields from each source, showing all instances of every field. Appendix C shows the extracted canonical version of the bibliographic entry which contains only selected information from the merged text file. This latter version is displayed to the user through the user interface (see SEARCH).

5.3 The codes files

The third basic function of the loading procedures is to modify and maintain the listings for available items. The ADS displays the availability of resources or information related to bibliographic entries as letter codes in the results list of queries and as more descriptive hyperlinks in the page displaying the full information available for a bibliographic entry. A full listing of the available item codes and their meaning is given in SEARCH.

The loading routines maintain lists of bibliographic codes for each letter code in the system which are converted to URLs by the indexing routines (see ARCHITECTURE). Bibliographic codes are appended to the lists either during the loading process or as post-processing work depending on the availability of the resource. When electronic availability of data coincides with our receipt of the data, the bibliographic codes can be appended to the lists by the loading procedures. When we receive the data prior to electronic availability, post-processing routines must be run to update the bibliographic code lists after we are notified that we may activate the links.

