Up: Spectral decomposition by genetic

2 Motivation and method

Prior to the launch of SOHO, a study was undertaken (Brynildsen 1994) to identify the "best" profile fitting package for the CDS and SUMER instruments. The study compared various algorithms for fitting Gaussian profiles, or combinations thereof.

The common denominator linking all of the profile fitting algorithms studied by Brynildsen (CURVEFIT - from the Interactive Data Language (IDL) userlib, and AMOEBA - A "downhill" SIMPLEX algorithm from Press et al. 1992, and others) is the need for user input regarding starting points for each parameter in the search. This potential source of user bias, and the reduced quality (in terms of fit to the data) of the parameterisation form the principal motivation for this paper, and indeed we show that they are not present using a GA technique beyond the absolute minimum requirement of supplying a "line list" of lines to be identified.

Using a GA for this profile fitting problem can have many advantages not available to the user of predictive line fitting algorithms. Considering one of the many advantages noted in Charbonneau (1995), a GA is not de-stabilised by noise in the data; it will merely attempt to achieve its goal, locating the "best" profile. The GA will attain this goal, the introduction of data noise will merely affect the convergence time of the algorithm.

We present a "simple" GA, called Ga-GA, which we show to be stable against reasonable noise levels and to have no source of possible user bias. The following sections discuss its performance in detail.

2.1 Overview of a simple Genetic Algorithm

Genetic Algorithms are inspired by the mechanism of natural selection and basic genetic operators, occuring naturally in biological systems, see Holland (1962). Consider a typical numerical optimization task, where a parametric model is to be fit to data in a manner that maximises the closeness of fit, or fitness (as measured, for example, by a $\chi^2$ statistical estimator). A genetic algorithm is an iterative scheme that operates on a population of trial solutions to the problem in the following way:

1.: Construct an initial population using random values for the model parameters, and evaluate their fitness.
2.: Select a subset of the fitter individuals, and breed them to produce a new population.
3.: Evaluate the fitness of each individual in the new population.
4.: Replace the old population with the new one.
5.: Check whether the fitness has reached some pre-defined tolerance, or the number of iterations (or generations) has reached its maximum; if not return to step 2.

GAs carry out a form of forward modelling, by performing a heuristic search of the problem's parameter space. What distinguishes a GA from other forward modelling methods (such as Monte Carlo simulation) is primarily the way in which new trial solutions are constructed from the current population of trial solutions (cf. step 2 above).

At the most basic level a GA can be viewed as a processor of a set of strings, each encoding a particular version of the model being optimised. A subset of the fitter individuals of the current population are selected and paired, and the defining strings of each such pair are subjected to the action of two genetic operators: cross-over and mutation. The cross-over operation involves dissection of the two parent strings at a randomly chosen point along the string, followed by the interchange of the dissected components. In this way two new strings are produced from two parent strings (see Fig. 1). The second operator, mutation, involves the replacement of a few randomly selected digit in the two strings produced by the cross-over operation with randomly generated digit values. Its primary purpose is to maintain a suitable level of variation in the population, which is essential for selection to operate. The combination of stochastic genetic operators with fitness-based selection yields a powerful search algorithm that can move away from secondary extrema and locate the global extremum in parameter space (see, e.g., Goldberg 1989; Davis 1991; Charbonneau 1995)

$\begin{figure} \centering \resizebox {\hsize}{!}{\includegraphics{H0844F1.ps}} \vspace{-2mm}\end{figure}$

Figure 1: A pictorial explanation of the main GA breeding operator, cross-over

In this paper we are using a GA version which implements a scheme involving a variable mutation rate, i.e. as the population becomes more degenerate (little variation) the probability of a mutation taking place is correspondingly increased, and makes use of elitism, the best individual in the old generation is not replaced unless there is a fitter one in the new generation. The selection of individuals in the breeding operator is carried out using a roulette-wheel algorithm (see Davis 1991, Ch. 1), meaning that individuals with higher fitness are associated with sectors of correspondingly large angle on the roulette wheel. This roulette wheel, when "spun", ensures that although all individuals are capable of breeding, the fitter individuals have a slightly higher probability of being selected.

In many ways our GA resembles that of Charbonneau (1995), but it also contains some features of the GA presented in Diver & Ireland (1997). Indeed, in the cases presented in Sect. 4 we have employed a variation on the algorithm PIKAIA presented in Charbonneau (1995) to maximise speed and accuracy.

2.2 Fitness evaluation

Isolation of particular features (e.g. line width and absolute intensity) in an emission line spectrum made up of N lines is a procedure used by many standard fitting algorithms, with many using line identification as their primary "search" (cf. the user input given to the algorithms mentioned above). On aquiring the line position they sequentially alter the amplitude or the $\frac{1}{e}$ width of the Gaussian profile(s) to achieve the "best" fit to the target. However, since the observed emission line spectra can and do, contain a large number of profiles, it is possible to adopt a method which solves for all lines simultaneously (see e.g., Diver 1995; Diver & Ireland 1997).

When Ga-GA "recognises" spectral features, i.e. one of the Gaussian describing parameters or an entire profile, the corresponding final solution will be a better representation of the target and will result in that string of parameters being given a higher fitness. Since Ga-GA uses the mechanics of natural selection, a genotype with a higher fitness value will be prevalent in the current and future generations until replaced by a "fitter" individual.

Ga-GA uses parameter strings of length $3\,\times\,N$ , where N is the estimated number of Gaussian profiles in the line spectrum to be analysed, and three because it requires three parameters to describe a general Gaussian profile. These parameters are absolute position in wavelength, at channel (X), amplitude (A), and the Gaussian's $\frac{1}{e}$ value (W) and are encoded as a string in the following order:

$\begin{displaymath} \left[X_1, A_1, W_1,\ldots\ldots, X_N, A_N, W_N\right].\end{displaymath}$

A string of the form above defines a sequence of N Gaussian profiles that defines a synthetic spectrum, this computed profile is an individual's phenotype. It is this phenotype profile that is retained for comparison to the observed spectrum. Phenotype profiles are calculated using the standard pointwise Gaussian formula, i.e. for a particular channel x, usually associated with wavelength, in Gaussian i (G_i):

$\begin{displaymath} G_{i}(x) = A_i\,\exp\left(\frac{-\left(\,x\,-\,X_{i}\right)^{2}} {W_i^2}\right).\end{displaymath}$

(1)

The N Gaussian profiles derived from a particular genotype string are summed to form the "unique" phenotypic profile for genotype j, $P(\underline{x})_{j}$ (with $\underline{x}$ meaning for all channels x). $P(\underline{x})_{j}$ is given by:

$\begin{displaymath} P(\underline{x})_{j} = \sum_{i=1}^{N}\,G_{i}(x) \,\,\,\,\,\, \forall x.\end{displaymath}$

(2)

Only once $P(\underline{x})_{j}$ has been computed do we calculate an error measure between it and the target. The error measure of a particular genotype ( $E(\underline{x})_j$ ) depends on several factors; the square pointwise difference of the target and the corresponding phenotype ( $C(\underline{x})$ , and $P(\underline{x})_{j}$ ), the number of parameters in the calculation ( $3 \times N$ ), the number of points summed over ( $N_{\rm data}$ ) and an estimate of the noise level in the data ( $\sigma_{\rm data}(\underline{x})$ ). Thus, $E(\underline{x})_j$ (effectively a normalised $\chi^2$ measure) is given by:

$\begin{displaymath} E(\underline{x})_{j} = \frac{1}{(N_{\rm data} - 3N)}\,\,\sum... ...e{x})_{j}\right)}{\sigma_{\rm data}(\underline{x})} \right)^{2}\end{displaymath}$

(3)

with $E(\underline{x})_{j} \sim 1$ indicating a "good" fit.

This measure is used to evaluate the fitness of each genotype. It is the fitness value that is used to rank all the genotypes in a particular population into ascending order and to "weight" the roulette wheel of Sect. 2.1.

Up: Spectral decomposition by genetic