5 Statistical signal analysis

A realistic estimation of the compression efficiency must be based on a quantitative analysis of the signal statistics, which includes: statistics of the binary representation (Sect. 5.1), entropy (Sect. 5.2) and normality tests (Sect. 5.3).

5.1 Binary statistics

Figure 3 represents the frequency distribution of
symbols when the full data stream of 60 scan circles is divided
into 8 bits words. Since for most of the samples the range spans
over
levels (5 bits) only the bytes corresponding to
the MSB words assume a limited range of values producing the
narrow spike in the figure.

The belt shaped distribution at the edges is due to the set of LSB words. The distributions are quite sensitive to the quantization step, but do not change too much with the signal composition, the largest differences coming from the cosmological dipole contribution.

From the distribution in Fig. 3 one may wonder if it would not be possible to obtain a more effective compression splitting the data stream into two substreams: the MSB substream (with compression efficiency ) and the LSB substream (with compression efficiency ). Since the two components are so different in their statistics, with the MSB substream having an higher level of redundancy than the original data stream, it would be reasonable to expect that the final compression rate be greater than the compression rate obtained compressing directly the original data stream. We tested this procedure taking some of the compressors considered for the final test. From these tests it is clear that but since most of the redundancy of the original data stream is contained in the MSB substream the LSB substream can not be compressed in an effective way, as a result and . So the best way to perform an efficient compression is to apply the compressor to the full stream without performing the MSB/LSB separation. Apart from these theoretical considerations, we performed some tests with our simulated data stream confirming these result.

5.2 Entropy analysis

Equation (9) is valid in the limit of a continuous distribution of quantization levels. Since in our case the quantization step is about one tenth of the signal rms this is no longer true. To properly estimate the maximum compression rate attainable from these data we evaluate the entropy of the discretized signal using different values of the .

Our entropy evaluation code takes the input data stream and
determines the frequency *f*_{s} of each symbol *s* in the
quantized data stream and computing the entropy as:
where *s* is the symbol index. In our simulation we
take both 8 and 16 bits symbols (*s* spanning over 0, ,
255 and 0, ,
). Since in our scheme the ADC
output is 16 bits, we considered 8 bits symbols entropy both for
the LSB and MSB 8 bits word and 8 bits entropy after merging the
LSB and MSB significant bits set.

As expected, since merely shifts the quantized signal distribution, entropy does not depend on . For this reason we take V, i.e., no shift.

Table 2 reports the 16 bits entropy as a function
of
,
composition and frequency.

30 GHz, White Noise | ||||||

Entropy (bits) | ||||||

Total | Mean | RMS | Total | Mean | RMS | |

16 | 5.1618 | 3.5596 | 0.1989 | 3.10 | 4.49 | 0.251 |

32 | 5.1618 | 4.1815 | 0.1658 | 3.10 | 3.83 | 0.152 |

64 | 5.1618 | 4.6108 | 0.1262 | 3.10 | 3.47 | 0.095 |

135 | 5.1618 | 4.8791 | 0.0890 | 3.10 | 3.28 | 0.060 |

8640 | 5.1618 | 5.1561 | 0.0114 | 3.10 | 3.10 | 0.007 |

17280 | 5.1618 | 5.1589 | 0.0061 | 3.10 | 3.10 | 0.004 |

30 GHz, Full Signal | ||||||

Entropy (bits) | ||||||

Total | Mean | RMS | Total | Mean | RMS | |

16 | 5.5213 | 3.5602 | 0.1982 | 2.90 | 4.49 | 0.250 |

32 | 5.5213 | 4.1849 | 0.1664 | 2.90 | 3.82 | 0.152 |

64 | 5.5213 | 4.6162 | 0.1278 | 2.90 | 3.47 | 0.096 |

135 | 5.5213 | 4.8885 | 0.0893 | 2.90 | 3.27 | 0.060 |

8640 | 5.5213 | 5.5119 | 0.0176 | 2.90 | 2.90 | 0.009 |

17280 | 5.5213 | 5.5157 | 0.0118 | 2.90 | 2.90 | 0.006 |

100 GHz, White Noise | ||||||

Entropy (bits) | ||||||

Total | Mean | RMS | Total | Mean | RMS | |

16 | 5.7436 | 3.6962 | 0.1740 | 2.79 | 4.33 | 0.204 |

32 | 5.7436 | 4.4174 | 0.1521 | 2.79 | 3.62 | 0.125 |

64 | 5.7436 | 4.9627 | 0.1230 | 2.79 | 3.22 | 0.080 |

135 | 5.7436 | 5.3354 | 0.0875 | 2.79 | 3.00 | 0.049 |

8640 | 5.7436 | 5.7352 | 0.0115 | 2.79 | 2.79 | 0.006 |

17280 | 5.7436 | 5.7394 | 0.0063 | 2.79 | 2.79 | 0.003 |

100 GHz, Full Signal | ||||||

Entropy (bits) | ||||||

Total | Mean | RMS | Total | Mean | RMS | |

16 | 5.8737 | 3.6970 | 0.1734 | 2.72 | 4.33 | 0.203 |

32 | 5.8737 | 4.4186 | 0.1526 | 2.72 | 3.62 | 0.125 |

64 | 5.8737 | 4.9655 | 0.1224 | 2.72 | 3.22 | 0.079 |

135 | 5.8737 | 5.3419 | 0.0887 | 2.72 | 3.00 | 0.050 |

8640 | 5.8737 | 5.8604 | 0.0180 | 2.72 | 2.73 | 0.008 |

17280 | 5.8737 | 5.8655 | 0.0127 | 2.72 | 2.73 | 0.006 |

As obvious entropy, i.e. information content, increases increasing i.e. quantization resolution. The entropy

Since data will be packed in chuncks of finite length it is important not only to study the entropy distribution for the entire data-stream, which will give an indication of the overall compressibility of the data stream as a whole, but also the entropy distribution for short packets of fixed length. So each data stream was splitted into an integer number of chunks of fixed length . For each chunck the entropy was measured, and the corresponding distribution of entropies for the given as its mean and rms was obtained. We take , 32, 64, 135, 8640, 1728016-bits samples, so each simulated data stream will be splitted into , , 8100, 3840, 60, 30 chuncks. Small chunck sizes are introduced to study the entropy distribution as seen by most of the true compressors which do not compress one circle (8640 samples) at a time. Long chuncks distributions are usefull to understand the entropy distribution for the overall data-stream.

The entropy distribution per chunck is approximately described by a
normal distribution (see Fig. 4),
so the mean entropy and its rms are enough to characterize the results.

Note however that the corresponding distribution of compression rates is not exactly normally distributed, however for the sake of this analysis we will assume that even the distribution is normally distributed.

The mean entropy measured over one scan circle
(
samples) coincides with the entropy
measured for the full set of 60 scan circles, the entropy rms
being of the order of 10^{-2} bits. Consequently the expected
rms for
compressing one or more circles at a time will be
less than .

The mean entropy and its rms are not independent quantities. Averaged entropy decreases as decreases, but correspondingly the entropy rms increases. As a consequence the averaged decreases decreasing , but the fraction of chunks in which the compressor performs significantly worst than in average increases. The overall compression rate, i.e. the referred to the full mission, being affected by them.

5.3 Normality tests

Since normal distribution of signals is assumed in Sect. 4 it would be interesting to fix how much the digitized signal distribution
deviates from the normality.
Also it would be important to characterize the influence
of the 1/*f* noise and of the other signal components, especially
the cosmic dipole, in the genesis of such deviations. To obtain an
efficient compression it would be important that the samples are
as more as possible statistically uncorrelated and normally
distributed. In addition one should make sure that the detection
chain does not cause any systematic effect which will introduce
spurious not normal distributed components. This is relevant not
only for the compression problem itself, which is among the data
processing operations the least sensitive to small deviations from
the normal distribution, but also in view of the future data
reduction, calibration and analysis. For them the hypothesis of
normality in the signal distribution is very important in order to
allow a good separation of the foreground components. Last but not
least, the hypothesis of conservation of normality along the
detection chain, is important for the scientific interpretation of
the results, since the accuracy expected from the PLANCK-LFI
experiment should allow to verify if really the distribution of
the CMB fluctuations at
is normal, as predicted by
the standard inflationary models, or as seems suggested by
recent 4 years COBE/DMR results
(Bromley & Tegmark 1999; Ferreira et al. 1999).

For this reason a set of normality tests was applied to the different components of the simulated signal before and after digitization in order to characterize the signal statistics and its variation along the detection process. Of course this work may be regarded as a first step in this direction, a true calibration of the signal statistics will be possible only when the front end electronics simulator will be available. Those tests have furthermore the value of a preparation to the study of the true signal.

Normality tests were applied on the same data streams used for
data compression. Given on board memory limits, it is unlikely
that more than a few circles at a time can be stored before
compression, so statistical tests where performed regarding each
data stream for a given pointing, as a collection of 60
independent realizations of the same process. Of course this is
only approximately true. The 1/*f* noise correlates subsequent scan
circles, but since its rms amplitude per sample is typically
about one-tenth of the white noise rms or less, these
correlations can be neglected in this analysis.

Starting from the folded data streams a given normality test was
applied to each set of 60 realizations for each one of the 8640
samples, transforming the *stream of samples* in a *stream
of test results* for the given test. The cumulative distribution
of frequency was then computed over the 8640 test results. Since
60 samples does not represent a large statistics, significant
deviations from theorethically evaluated confidence levels
are expected resulting in an excessive rejection or acceptation rates.
For this reason each test was calibrated applying it
to the undigitized white noise data stream.
Moreover, in order to analyze how the normality evolves increasing the
signal complexity, tests was repeated increasing the information content
of the generated data stream.

To simplify the discussion we considered as a reference test the
usual Kolmogorow - Smirnov *D* test from
Press et al. (1986) and we fix a
acceptance
level. The test was "calibrated'' using the Monte-Carlo white
noise generator of our mission simulator in order to fix the
threshold level
as the *D* value for which
more than
of our samples show
.
From Table 3 the quantization effect is evident, at
twice the nominal quantization step (
V/K) in
of the
samples (i.e. 2592 samples) the distribution of realizations
deviates from a normal distribution (
).

(mK/adu) | |||

1.220 | 0.610 | 0.406 | |

0.28 | 0.70 | 0.84 | |

0.27 | 0.71 | 0.86 | |

0.2449 | 0.1851 | 0.1678 | |

0.95 | 0.95 | 0.95 |

Since the theoretical compression rate from Eq. (9) is for a continuous distribution of levels ( ) a smaller should be expected. Since the deviation from the normal distribution is a systematic effect, for the sake of cosmological data analysis one may tune the

As for the entropy distribution and the binary statistics, even in this case most of the differences between the results obtained for a pure white noise signal and the full signals are explained by the presence of the cosmological dipole. However these simulations are not accurate enough to draw any quantitative conclusions about the distortion in the sampling statistics induced by digitization, but they suggest that to approximate the instrumental signal as a quantized white noise plus a cosinusoidal term associated to the cosmic dipole is more than adequate in order to understand the optimal loss-less compression rate achievable in the case of the PLANCK-LFI mission.

Copyright The European Southern Observatory (ESO)