4 Cleaning with a neural network

4.1 Construction of a large training sample

We use a Neural Network (hereafter NN) method to perform the cleaning. This requires the construction of a large training sample. We build it by cross-identifying each object of our preliminary catalogue with known stars or galaxies. The known stars are taken from the SAO catalogue. The known galaxies are taken from the LEDA database.

The cross-identification is based on the J2000 equatorial coordinates. The identity of two objects is accepted when there is only one object within a radius of 10''. This severe constraint removes interacting objects which are not suitable for a training sample. So, we obtain 54 186 objects classified as galaxy "G'' and 90 339 classified as star "S''. Further, 2105 objects are classified as defect "D'' because of their discrepant characteristics (e.g., a very elongated matrix with nli/npx > 25 or npx/nli >25). Objects with $\log D <1.25$ are not used for the construction of the training sample.

4.2 Definition of neural network input parameters

The NN has three outputs: G, S and D for galaxies, stars and defects, respectively. The choice of the input parameters of the NN is important. They must be very discriminant with reference to the outputs. There is no rule for the choice of these parameters. We define seven parameters which will be tested with our training sample:

1.

P₁: The dispersion of pixel optical densities. This is the parameter defined for the first analysis (Sect. 2);

2.

P₂: The inverse of the surface area. This is also the parameter previously used;

3.

P₃: The logarithm of axis ratio of the object. An elongated object, if not detected as a defect, has more chance being a galaxy than a star; it is better to use the axis ratio of the object (calculated as in Sect. 5) instead of the ratio of the sides of the matrix (npx/nli or nli/npx). Indeed, a very elongated object can be located along a diagonal of the squared matrix;

4.

P₄: The square of the external perimeter divided by the matrix surface area. This parameter is simply defined as $4(npx+nli)^2 / (npx \, nli)$ and it is very sensitive to elongated features like scratches, or branches of a diffraction cross, or satellite tracks;

5.

P₅: The ratio of the object surface area divided by the matrix surface area. This parameter is useful for detecting artefacts like those encountered on calibrating spots near the edge of the plate (in this case the ratio is nearly one), patchy objects or elongated features;

6.

P₆: The diffraction-cross parameter dc. The matrix is divided into nine identical rectangles numbered from 1 to 9 according to Fig. 7. The diffraction cross is defined as the mean intensity of rectangles 2, 4, 6, 8 divided by the mean intensity of rectangles 1, 3, 7, 9;

$\begin{figure} \par\includegraphics[width=5cm,clip]{ds1851f7.eps}\end{figure}$

Figure 7: Decomposition of a matrix in nine rectangles. This decomposition is used to define the diffraction cross parameter dc and the defect parameter df

7.

P₇: The defect parameter. Let na (and nb) be the number of pixels with an intensity above (and below) the sky background intensity inside the central rectangle (5). The defect parameter is defined as: df=(nb-na)/(nb+na). This is simply the proportion of blank pixels inside the central rectangle. For an astronomical object we expect no (or a few) blank pixel in the very center. For an extreme defect we have df=1. For a normal object we have df=-1.

In Figs. 8 to 10 we show some of the discriminant parameters for galaxies and stars.

$\begin{figure} \par\includegraphics[width=8cm]{ds1851f8.eps}\end{figure}$

Figure 8: Histogram of the square of the external perimeter divided by the matrix surface area for stars (solid line) and galaxies (dashed line)

$\begin{figure} \par\includegraphics[width=8cm]{ds1851f9.eps}\end{figure}$

Figure 9: Histogram of the diffraction-cross parameter for stars (solid line) and galaxies (dashed line)

$\begin{figure} \par\includegraphics[width=8cm,clip]{ds1851f10.eps}\end{figure}$

Figure 10: Histogram of the ratio of the object surface area divided by the matrix surface area for stars (solid line) and galaxies (dashed line)

4.3 The neural network definition

After several trials and errors, we adopted the NN represented in Fig. 11. Because there are only three output parameters (G, S, D) we adopted a simple NN with only one intermediate layer of 10 neurones each of them having 7 input parameters and 3 output ones. There are 100 free weights W connecting two neurones of two different layers. The input is a vector with seven components. The output is a vector with three components. The expected output vectors are (1, 0, 0), (0, 1, 0), (0, 0, 1).

The different steps of the NN training are the following. First of all, the weights are randomly chosen between -1 and 1. Then, the training sample is read and each individual parameter is normalized by subtracting its mean and dividing by its standard deviation, both calculated from the whole sample. So, we obtain seven input components P_i. An object is a vector with seven components. Then, for each object, the seven input parameters P_i are entered and propagated down to the last layer (output layer). For this purpose, the input X of a given neurone is the weighted mean of its input connections while its output is calculated through a non linear sigmoid function:

$\begin{displaymath}% s= \frac {1}{(1+{\rm e}^{-X})} \cdot \end{displaymath}$

(3)

The error vector E is determined by comparing the calculated output vector with the known output vector (this is done with the training sample). Then, E is propagated back using the weights for sharing the error onto the different branches and the derivative of the sigmoid function

$\begin{displaymath}% \frac {{\rm d}s}{{\rm d}X}= \frac {-{\rm e}^{-X}}{(1+{\rm e}^{-X})^{2}} \end{displaymath}$

(4)

for crossing back a neurone. Finally, the weights are corrected accordingly.

The process is repeated (i.e., the normalized input parameters are entered and the calculation is done again) until the system becomes stable. In practice this is done by testing different iteration numbers.

$\begin{figure} \par\includegraphics[width=8.8cm,clip]{ds1851f11.eps}\end{figure}$

Figure 11: Representation of the adopted neural network. P1 to P7are the seven input parameters. G, S, D are the three output values for galaxy, star and defect, respectively. Each open circle is a neurone. The connection between two neurones has a weight W

We did some preliminary tests on the whole training sample to find the best number of intermediate neurones and the best number of iterations. We tested the number of intermediate neurones between 7 and 42 and the number of iterations between 50 and 600. Finally, we decided to adopt 10 intermediate neurones and 100 iterations.

4.4 Setting and testing the NN

An efficient way to demonstrate the success of an automatic classification programme is the usage of a "control sample'', i.e. determine the automated parameters (G, S, D from NN) in the same way for the control sample and compare these with independent reference values of the control sample. Actually, we built nine control samples. The whole sample of 132972 objects with proper object classification was divided into ten non-overlapping subsamples S0 to S9 having the same size (1/30-th of the total sample). The NN was programmed ten times and we kept the solution (S0) giving the best result for the whole sample. Then, to prove the validity of the NN, configured with S0 only, we applied this configuration to the nine independent samples S1 to S9. The results for these nine control samples are given in Table 1.

Table 1: Application of the NN, configured with a subsample S0, to nine independent subsamples S1 to S9, for which the result of the classification is known. When the NN gives the good answer the result is considered as a success. We give the size and the percentage of successes for each subsamples S1 to S9 and for the whole sample
samples size percentage of success

S1 4667 94%

S2 4667 94%

S3 4667 93%

S4 4667 93%

S5 4667 93%

S6 4667 92%

S7 4667 94%

S8 4667 94%

S9 4667 94%

Total sample 132972 84%

**Table 1:** Application of the NN, configured with a subsample S0, to nine independent subsamples S1 to S9, for which the result of the classification is known. When the NN gives the good answer the result is considered as a success. We give the size and the percentage of successes for each subsamples S1 to S9 and for the whole sample
samples	size	percentage of success
S1	4667	94%
S2	4667	94%
S3	4667	93%
S4	4667	93%
S5	4667	93%
S6	4667	92%
S7	4667	94%
S8	4667	94%
S9	4667	94%
Total sample	132972	84%

Obviously, the components of the calculated output vector (G, S, D) are not exactly 0 or 1. The component G obtained with the training sample is shown in Fig. 12. Most of its values are 0 or 1 (i.e., the NN answers either "yes'' if it is a galaxy, or "no'' if it is not a galaxy).

$\begin{figure} \par\includegraphics[width=8cm,clip]{ds1851f12.eps}\end{figure}$

Figure 12: NN-Output G obtained with the training sample. Most of the values are close to zero or one. Components S or D have exactly the same bimodal distribution

In our control we considered a result as good when the largest component corresponds to the expected one. For instance: if we got the answer: G=0.7, S=0.6, D=0.1for an object known as a galaxy (G=1, S=0, D=0) we concluded that the NN gave a right answer, because the largest component is G.

For the final application of the NN we imposed more severe constraints in order to reduce contamination of the galaxy catalogue by stars or defects. The adopted conditions were those given in Table 2. Further, some objects are considered a priori as defects when the parameter P₅ (ratio of the object surface area by the matrix surface area) is larger than 0.95 (case of a matrix almost without sky background pixel), or when the axis ratio is larger than 100.

Table 2: Additional constraints
Conditions Classification code

$G \ge 0.9$ and S < 0.5 and D < 0.5 Galaxy G

$G \ge 0.8$ and S < 0.2 and D < 0.2 Probable Gal. g

$E \ge 0.8$ and D < 0.5 and G < 0.5 Star S

D > E and D > G Defect D

otherwise Possible Gal. -

**Table 2:** Additional constraints
Conditions	Classification	code
$G \ge 0.9$ and S < 0.5 and D < 0.5	Galaxy	G
$G \ge 0.8$ and S < 0.2 and D < 0.2	Probable Gal.	g
$E \ge 0.8$ and D < 0.5 and G < 0.5	Star	S
D > E and D > G	Defect	D
otherwise	Possible Gal.	-

Using this NN cleaning we classified: $1\ 147 \ 332$ objects as galaxies (G), $134\ 509$ as probable galaxies (g), $1\ 940\ 573$ as "possible galaxies'' (-). We classified: $179 \, 842$ objects as stars (S) (in addition to the catalogue of 47 million stars previously extracted), $946\ 884$ as defects (D).

Up: An image database