We applied the method just described to the classification of articles
published in Astronomy and Astrophysics in the period 1994 to 1996
(3325 articles).
The descriptors were based on the bibliographic keywords. For this journal,
and some others in astronomy, there is a uniquely defined list of keywords.
A necessary preliminary phase was to homogenize them
(since they were assigned by one of the Editors but not in a completely
systematic fashion). We kept only the keywords that appear in at least
5 different articles, so we limited our descriptors to the 250 most frequent.
The documents characterized in this way constitute a set of 3325
stimuli to be applied to the network. Learning through 20 iterations
(heuristically determined)
gives good results for the primary map ( units)
requiring about 1 hour of processing on a Sparcstation 10 (dependent
on other users). For the secondary maps, the learning time is much
shorter, since fewer documents are processed (200 at most), and
the network has a smaller dimensionality (25 units).
When the training of all the maps, principal and secondary, is finished, the next task is to make these maps accessible to the user.
At the end of the training of a map, the number of documents assigned
to each node is known. We therefore have a table of numbers. Because
it is much easier to visualize the colours of an image than a matrix of
numbers we transformed it into an image. For this image the colour scale
indicates qualitatively the number of documents per node.
The primary map is of dimension , and each of the secondary
maps is
. These images
are then scaled up by a factor of 40 (determined by aesthetics and
most common Web browser default window sizes). This transformation uses a
linear interpolation since otherwise the map would have
clear discontinuities.
For map interpretability, the different themes associated with the document/node assignments have to be indicated. Although our maps have a relatively limited number of units, while in comparison the maps proposed by the team of T. Kohonen (WEBSOM 1997) have about 8 times more neurons, it is still impossible to characterize all nodes without overlapping annotations. Therefore it is preferable to select a limited number of nodes for characterization.
These nodes are selected from the frequent occurence of a keyword, which is written on the map. This was done manually, but could later be automated. The strategy is as follows:
Copyright The European Southern Observatory (ESO)