One of the basic principles in the parsing and formatting of the bibliographic data incorporated into the ADS database over the years has been to preserve as much of the original information as possible and delay any syntactic or semantic interpretation of the data until a later stage. From the implementation point of view, this means that bibliographic records provided to the ADS by publishers or other data sources typically are saved as files which are tagged with their origin, entry date, and any other ancillary information relevant to their contents (e.g. if the fields in the record contain data which was transliterated or converted to ASCII).
For instance, the records provided to the ADS by the University of Chicago Press (the publisher of several major U.S. astronomical journals) are SGML documents which contain a unique manuscript identifier assigned to the paper during the electronic publishing process. This identifier is saved in the file created by the ADS system for this bibliographic entry.
Because data about a particular bibliographic entry may be provided to the ADS by different sources and at different times, we adopted a multi-step procedure in the creation and management of bibliographic records:
1) Tokenization: Parsing input data into a memory-resident data structure using procedures which are format- and source-specific;
2) Identification: Computing the unique bibliographic record identifier used by the ADS to refer to this record;
3) Instantiation: Creating a new record for each bibliography formatted according to the ADS "standard" format;
4) Extraction: Selecting the best information from the different records available for the same bibliography and merging them into a single entry, avoiding duplication of redundant information.
The activity of parsing a (possibly) loosely-structured bibliographic record is typically more of an art than a science, given the wide range of possible formats used by people for the representation and display of these records. The ADS uses the PERL language (Practical Extraction and Report Language, [Wall & Schwartz 1991]) for implementing most of the routines associated with handling the data. PERL is an interpreted programming language optimized for scanning and processing textual data. It was chosen over other programming languages because of its speed and flexibility in handling text strings. Features such as pattern matching and regular expression substitution greatly facilitate manipulating the data fields. To maximize flexibility in the parsing and formatting operations of different fields, we have written a set of PERL library modules and scripts capable of performing a few common tasks. Some that we consider worth mentioning from the methodological point of view are listed below.
|%E||URL for Electronic Data Table|
|%J||Journal Name, Volume, and Page Range|
|%L||Last Page of Article|
|%U||URL for Electronic Document|
|%W||Database (AST, PHY, INST)|
Since the majority of our data sources do not provide author names in our standard format (last name, first name or initial), our loading routines need to be able to invert author names accurately, handling cases such as multiple word last names (Da Costa, van der Bout, Little Marenin) and suffixes (Jr., Sr., III). Any titles in an author's name (Dr., Rev.) were previously omitted, but are now being retained in the new XML formatting of text files.
The assessment of what constitutes a multiple word last name as opposed to a middle name is non-trivial since some names, such as Davis, can be a first name (Davis Hartman), a middle name (A.G. Davis Philip), a last name (Robert Davis), or some combination (Davis S. Davis). Another example is how to determine when the name "Van" is a first name (Van Nguyen), a middle name (W. Van Dyke Dixon), or part of a last name (J. van Allen). Handling all of these cases correctly requires not only familiarity with naming conventions worldwide, but an intimate familiarity with the names of astronomers who publish in the field. We are continually amassing the latter as we incorporate increasing amounts of data into the system, and as we get feedback from our users;
We call identification the activity of mapping the tokens extracted from the parsing of a bibliographic record into a unique identifier. The ADS adopted the use of bibliographic codes as the identifier for bibliographic entries shortly after its inception, in order to facilitate communication between the ADS and SIMBAD. The advantage of using bibliographic codes as unique identifiers is that they can most often be created in a straightforward way from the information given in the list of references published in the astronomical literature, namely the publication year, journal name, volume, and page numbers, and first author's name (see Sect. 3.1 for details).
"Instantiation" of a bibliographic entry consists of the creation of a record for it in the ADS database. The ADS must handle receipt of the same data from multiple sources. We have created a hierarchy of data sources so that we always know the preferred data source. A reference for which we have received records from STI, the journal publisher, SIMBAD, and NED, for example, must be in the system only once with the best information from each source preserved. When we load a reference into the system, we check whether a text file already exists for that reference. If there is no text file, it is a new reference and a text file is created. If there already is a text file, we append the new information to the current text file, creating a "merged" text file. This merged text file lists every instance of every field that we have received.
By "extraction" of a bibliographic entry we mean the procedure used to create a unique representation of the bibliography from the available records. This is essentially an activity of data fusion and unification, which removes redundancies in the bibliographic records obtained by the ADS and properly labels fields by their characteristics. The extraction algorithm has been designed with our prior experience as to the quality of the data to select the best fields from each data source, to cross-correlate the fields as necessary, and to create a "canonical" text file which contains a unique instance of each field. Since the latter is created through software, only one version of the text file must be maintained; when the merged text file is appended, the canonical text file is automatically recreated.
The extraction routine selects the best pieces of information from each source and combines them into one reference which is more complete than the individual references. For example, author lists received from STI were often truncated after five or ten authors. Whenever we have a longer author list from another source, that author list is used instead. This not only recaptures missing authors, it also provides full author names instead of author initials whenever possible. In addition, our journal sources sometimes omit the last page number of the reference, but SIMBAD usually includes it, so we are able to preserve this information in our canonical text file.
Some fields need to be labelled by their characteristics so that they are properly indexed and displayed. The keywords, for example, need to be attributed to a specific keyword system. The system designation allows for multiple keyword sets to be displayed (e.g. NASA/STI Keywords and AAS Keywords) and will be used in the keyword synonym table currently under development ([Lee et al. 1999]).
We also attempt to cross-correlate authors with their affiliations wherever possible. This is necessary for records where the preferred author field is from one source and the affiliations are from another source. We attempt to assign the proper affiliation based on the last name and do not assume that the author order is accurate since we are aware of ordering discrepancies in some of the STI records.
Through these four steps in the procedure of creating and managing bibliographic records, we are able to take advantage of receiving the same reference from multiple sources. We standardize the various records and present to the user a combination of the most reliable fields from each data source in one succinct text file.
Copyright The European Southern Observatory (ESO)