4 Creating the bibliographic records

One of the basic principles in the parsing and formatting of the bibliographic data incorporated into the ADS database over the years has been to preserve as much of the original information as possible and delay any syntactic or semantic interpretation of the data until a later stage. From the implementation point of view, this means that bibliographic records provided to the ADS by publishers or other data sources typically are saved as files which are tagged with their origin, entry date, and any other ancillary information relevant to their contents (e.g. if the fields in the record contain data which was transliterated or converted to ASCII).

For instance, the records provided to the ADS by the University of Chicago Press (the publisher of several major U.S. astronomical journals) are SGML documents which contain a unique manuscript identifier assigned to the paper during the electronic publishing process. This identifier is saved in the file created by the ADS system for this bibliographic entry.

Because data about a particular bibliographic entry may be provided to the ADS by different sources and at different times, we adopted a multi-step procedure in the creation and management of bibliographic records:

1) Tokenization: Parsing input data into a memory-resident data structure using procedures which are format- and source-specific;

2) Identification: Computing the unique bibliographic record identifier used by the ADS to refer to this record;

3) Instantiation: Creating a new record for each bibliography formatted according to the ADS "standard" format;

4) Extraction: Selecting the best information from the different records available for the same bibliography and merging them into a single entry, avoiding duplication of redundant information.

4.1 Tokenization

The activity of parsing a (possibly) loosely-structured bibliographic record is typically more of an art than a science, given the wide range of possible formats used by people for the representation and display of these records. The ADS uses the PERL language (Practical Extraction and Report Language, [Wall & Schwartz 1991]) for implementing most of the routines associated with handling the data. PERL is an interpreted programming language optimized for scanning and processing textual data. It was chosen over other programming languages because of its speed and flexibility in handling text strings. Features such as pattern matching and regular expression substitution greatly facilitate manipulating the data fields. To maximize flexibility in the parsing and formatting operations of different fields, we have written a set of PERL library modules and scripts capable of performing a few common tasks. Some that we consider worth mentioning from the methodological point of view are listed below.

Table 5: Tagged format definitions
Tag Name Comment

%R Bibliographic Code required

%T Title required

%A Author List required

%D Publication Date required

%B Abstract Text

%C Abstract Copyright

%E URL for Electronic Data Table

%F Author Affiliation

%G Origin

%H Email

%J Journal Name, Volume, and Page Range

%K Keywords

%L Last Page of Article

%O Object Name

%Q Category

%U URL for Electronic Document

%V Language

%W Database (AST, PHY, INST)

%X Comment

%Y Identifiers

%Z References

**Table 5:** Tagged format definitions
Tag	Name	Comment
%R	Bibliographic Code	required
%T	Title	required
%A	Author List	required
%D	Publication Date	required
%B	Abstract Text
%C	Abstract Copyright
%E	URL for Electronic Data Table
%F	Author Affiliation
%G	Origin
%H	Email
%J	Journal Name, Volume, and Page Range
%K	Keywords
%L	Last Page of Article
%O	Object Name
%Q	Category
%U	URL for Electronic Document
%V	Language
%W	Database (AST, PHY, INST)
%X	Comment
%Y	Identifiers
%Z	References

Character set conversion: electronic data are often delivered to us in different character set encodings, requiring translation of the data stream in one of the standard character sets expected by our input scripts. The default character set that has been used by the ADS until recently is "Latin-1'' encoding (ISO-8859-1, [International Organization for Standardization 1987]). We are now in the process of converting to the use of Unicode characters ([Unicode Consortium 1996]) encoded in UTF-8 (UCS Transformation Format, 8-bit form). The advantage of using Unicode is its universality (all character sets can be mapped to Unicode without loss of information). The advantage of adopting UTF-8 over other encodings is mainly the software support currently available (most of the modern software packages can already handle UTF-8 internally). The adoption of Unicode and UTF-8 also works well with our adoption of XML as the standard format for bibliographic data;
Macro and entity expansion: Several of the highly structured document formats in use today rely on the strengths of the formatting language for the specification of some common formatting tasks or data tokens. Typically this means that LaTeX documents that are supplied to us make use of one or more macro packages to perform some of the formatting tasks. Similarly, SGML documents will conform to some Document Type Definition (DTD) provided to us by the publisher, and will make use of some standard set of SGML entities to encode the document at the required level of abstraction. What this means for us is that even if most of the input data comes to us in one of two basic formats (TeX/LaTeX/BibTeX or SGML/HTML/XML), we must be able to parse a large number of document classes, each one defined by a different and ever increasing set of specifications, be it a macro package or a DTD;
Author name formatting: Special care has been taken in parsing and formatting author names from a variety of possible input formats to the standard one used by the ADS. The proper handling of author names is crucial to the integrity of the data in the ADS. Without proper author handling, users would be unable to get complete listings on searches by author names which comprise approximately two-thirds of all searches (see [Eichhorn et al. 2000], hereafter SEARCH).
Since the majority of our data sources do not provide author names in our standard format (last name, first name or initial), our loading routines need to be able to invert author names accurately, handling cases such as multiple word last names (Da Costa, van der Bout, Little Marenin) and suffixes (Jr., Sr., III). Any titles in an author's name (Dr., Rev.) were previously omitted, but are now being retained in the new XML formatting of text files.
The assessment of what constitutes a multiple word last name as opposed to a middle name is non-trivial since some names, such as Davis, can be a first name (Davis Hartman), a middle name (A.G. Davis Philip), a last name (Robert Davis), or some combination (Davis S. Davis). Another example is how to determine when the name "Van" is a first name (Van Nguyen), a middle name (W. Van Dyke Dixon), or part of a last name (J. van Allen). Handling all of these cases correctly requires not only familiarity with naming conventions worldwide, but an intimate familiarity with the names of astronomers who publish in the field. We are continually amassing the latter as we incorporate increasing amounts of data into the system, and as we get feedback from our users;
Spell checking: Since many of the historical records entered in the ADS have been generated by typesetting tables of contents, typographical errors can often be flagged in an automated way using spell-checking software. We have developed a PERL software driver for the international ispell program, a UNIX utility, which can be used as a spell-checking filter on all input to be considered textual information. A custom dictionary containing terms specific to astronomy and space sciences is used to increase the recognition capabilities of the software module. Any corrections suggested by the spell-checker module are reviewed by a human before the data are actually updated;
Language recognition: Extending the capability of the spell-checker, we have implemented a software module which attempts to guess the language of an input text buffer based on the percentage of words that it can recognize in one of several languages: English, German, French, Spanish, or Italian. This module is used to flag records to be entered in our database in a language other than English. Knowledge of the language of an abstract allows us to create accurate synonyms for those words (see ARCHITECTURE).

4.2 Identification

We call identification the activity of mapping the tokens extracted from the parsing of a bibliographic record into a unique identifier. The ADS adopted the use of bibliographic codes as the identifier for bibliographic entries shortly after its inception, in order to facilitate communication between the ADS and SIMBAD. The advantage of using bibliographic codes as unique identifiers is that they can most often be created in a straightforward way from the information given in the list of references published in the astronomical literature, namely the publication year, journal name, volume, and page numbers, and first author's name (see Sect. 3.1 for details).

4.3 Instantiation

"Instantiation" of a bibliographic entry consists of the creation of a record for it in the ADS database. The ADS must handle receipt of the same data from multiple sources. We have created a hierarchy of data sources so that we always know the preferred data source. A reference for which we have received records from STI, the journal publisher, SIMBAD, and NED, for example, must be in the system only once with the best information from each source preserved. When we load a reference into the system, we check whether a text file already exists for that reference. If there is no text file, it is a new reference and a text file is created. If there already is a text file, we append the new information to the current text file, creating a "merged" text file. This merged text file lists every instance of every field that we have received.

4.4 Extraction

By "extraction" of a bibliographic entry we mean the procedure used to create a unique representation of the bibliography from the available records. This is essentially an activity of data fusion and unification, which removes redundancies in the bibliographic records obtained by the ADS and properly labels fields by their characteristics. The extraction algorithm has been designed with our prior experience as to the quality of the data to select the best fields from each data source, to cross-correlate the fields as necessary, and to create a "canonical" text file which contains a unique instance of each field. Since the latter is created through software, only one version of the text file must be maintained; when the merged text file is appended, the canonical text file is automatically recreated.

The extraction routine selects the best pieces of information from each source and combines them into one reference which is more complete than the individual references. For example, author lists received from STI were often truncated after five or ten authors. Whenever we have a longer author list from another source, that author list is used instead. This not only recaptures missing authors, it also provides full author names instead of author initials whenever possible. In addition, our journal sources sometimes omit the last page number of the reference, but SIMBAD usually includes it, so we are able to preserve this information in our canonical text file.

Some fields need to be labelled by their characteristics so that they are properly indexed and displayed. The keywords, for example, need to be attributed to a specific keyword system. The system designation allows for multiple keyword sets to be displayed (e.g. NASA/STI Keywords and AAS Keywords) and will be used in the keyword synonym table currently under development ([Lee et al. 1999]).

We also attempt to cross-correlate authors with their affiliations wherever possible. This is necessary for records where the preferred author field is from one source and the affiliations are from another source. We attempt to assign the proper affiliation based on the last name and do not assume that the author order is accurate since we are aware of ordering discrepancies in some of the STI records.

Through these four steps in the procedure of creating and managing bibliographic records, we are able to take advantage of receiving the same reference from multiple sources. We standardize the various records and present to the user a combination of the most reliable fields from each data source in one succinct text file.

Up: The NASA Astrophysics Data