next previous
Up: Data streams from the efficiency


5 Statistical signal analysis

A realistic estimation of the compression efficiency must be based on a quantitative analysis of the signal statistics, which includes: statistics of the binary representation (Sect. 5.1), entropy (Sect. 5.2) and normality tests (Sect. 5.3).

5.1 Binary statistics

Most of the off-the-shelf compressors considered here do not handle 16 bits words, but 8 bits words. The 16 bits samples produced by the adc unit are splitted into two consecutive 8 bits (1 byte) words labeled: most significant bits (MSB) word and least significant bits (LSB) word. To properly understand the compression efficiency limits it is important to understand the statistical distribution of 8 bits words composing the quantized signal from LFI.

Figure 3 represents the frequency distribution of symbols when the full data stream of 60 scan circles is divided into 8 bits words. Since for most of the samples the range spans over $\approx 64$ levels (5 bits) only the bytes corresponding to the MSB words assume a limited range of values producing the narrow spike in the figure.

\par\includegraphics[width=8.8cm,clip]{H2223F3.eps} \end{figure} Figure 3: Statistical distribution of 8 and 16 bits words for LFI simulated signals. Upper row is for 30 GHz, lower row for 100 GHz. Left column are the distributions of 16 bits words from the quantized signals, right column is for 8 bits words from the quantized signal, full line is the distribution for pure white noise, dashed line is the distribution for the full signal

The belt shaped distribution at the edges is due to the set of LSB words. The distributions are quite sensitive to the quantization step, but do not change too much with the signal composition, the largest differences coming from the cosmological dipole contribution.

From the distribution in Fig. 3 one may wonder if it would not be possible to obtain a more effective compression splitting the data stream into two substreams: the MSB substream (with compression efficiency $C_{{\rm r}}^{{\rm MSB}}$) and the LSB substream (with compression efficiency $C_{{\rm r}}^{{\rm LSB}}$). Since the two components are so different in their statistics, with the MSB substream having an higher level of redundancy than the original data stream, it would be reasonable to expect that the final compression rate $2/(1/\mbox{$C_{{\rm r}}^{{\rm MSB}}$ } + 1/\mbox{$C_{{\rm r}}^{{\rm LSB}}$ })$ be greater than the compression rate obtained compressing directly the original data stream. We tested this procedure taking some of the compressors considered for the final test. From these tests it is clear that $\mbox{$C_{{\rm r}}^{{\rm MSB}}$ } >> \mbox{$C_{{\rm r}}$ }$ but since most of the redundancy of the original data stream is contained in the MSB substream the LSB substream can not be compressed in an effective way, as a result $\mbox{$C_{{\rm r}}^{{\rm LSB}}$ } < \mbox{$C_{{\rm r}}$ }$ and $2/(1/\mbox{$C_{{\rm r}}^{{\rm MSB}}$ } + 1/\mbox{$C_{{\rm r}}^{{\rm LSB}}$ }) \lesssim \mbox{$C_{{\rm r}}$ }$. So the best way to perform an efficient compression is to apply the compressor to the full stream without performing the MSB/LSB separation. Apart from these theoretical considerations, we performed some tests with our simulated data stream confirming these result.

5.2 Entropy analysis

Equation (9) is valid in the limit of a continuous distribution of quantization levels. Since in our case the quantization step is about one tenth of the signal rms this is no longer true. To properly estimate the maximum compression rate attainable from these data we evaluate the entropy of the discretized signal using different values of the ${{\rm VOT}}$.

Our entropy evaluation code takes the input data stream and determines the frequency fs of each symbol s in the quantized data stream and computing the entropy as: $- \sum_s f_s
\log_2 f_s$ where s is the symbol index. In our simulation we take both 8 and 16 bits symbols (s spanning over 0, $\dots $, 255 and 0, $\dots $, $65\, 535$). Since in our scheme the ADC output is 16 bits, we considered 8 bits symbols entropy both for the LSB and MSB 8 bits word and 8 bits entropy after merging the LSB and MSB significant bits set.

As expected, since ${\rm AFO}$ merely shifts the quantized signal distribution, entropy does not depend on ${\rm AFO}$. For this reason we take $\mbox{${\rm AFO}$ } = 0$ V, i.e., no shift.

Table 2 reports the 16 bits entropy as a function of ${{\rm VOT}}$, composition and frequency.


Table 2: Entropy for 16 bits samples at 30 and 100 GHz, for only White Noise and Full Signal as a function of $L_{{\rm chunck}}$. Total Entropy refers to the entropy computed over the full set of samples ( $8640 \times 60$), Mean and RMS Entropy are the mean and RMS of different realizations of chunks of samples of length $L_{{\rm chunck}}$. The same for $C_{{\rm r}}$ columns. Here $C_{{\rm r}}$ are derived from the corresponding values of the entropy. The quantization step is $\Delta = 0.305$ mK/adu
30 GHz, White Noise
  Entropy (bits) $C_{{\rm r}}$
$L_{{\rm chunck}}$ Total Mean RMS Total Mean RMS
16 5.1618 3.5596 0.1989 3.10 4.49 0.251
32 5.1618 4.1815 0.1658 3.10 3.83 0.152
64 5.1618 4.6108 0.1262 3.10 3.47 0.095
135 5.1618 4.8791 0.0890 3.10 3.28 0.060
8640 5.1618 5.1561 0.0114 3.10 3.10 0.007
17280 5.1618 5.1589 0.0061 3.10 3.10 0.004
30 GHz, Full Signal
  Entropy (bits) $C_{{\rm r}}$
$L_{{\rm chunck}}$ Total Mean RMS Total Mean RMS
16 5.5213 3.5602 0.1982 2.90 4.49 0.250
32 5.5213 4.1849 0.1664 2.90 3.82 0.152
64 5.5213 4.6162 0.1278 2.90 3.47 0.096
135 5.5213 4.8885 0.0893 2.90 3.27 0.060
8640 5.5213 5.5119 0.0176 2.90 2.90 0.009
17280 5.5213 5.5157 0.0118 2.90 2.90 0.006
100 GHz, White Noise
  Entropy (bits) $C_{{\rm r}}$
$L_{{\rm chunck}}$ Total Mean RMS Total Mean RMS
16 5.7436 3.6962 0.1740 2.79 4.33 0.204
32 5.7436 4.4174 0.1521 2.79 3.62 0.125
64 5.7436 4.9627 0.1230 2.79 3.22 0.080
135 5.7436 5.3354 0.0875 2.79 3.00 0.049
8640 5.7436 5.7352 0.0115 2.79 2.79 0.006
17280 5.7436 5.7394 0.0063 2.79 2.79 0.003
100 GHz, Full Signal
  Entropy (bits) $C_{{\rm r}}$
$L_{{\rm chunck}}$ Total Mean RMS Total Mean RMS
16 5.8737 3.6970 0.1734 2.72 4.33 0.203
32 5.8737 4.4186 0.1526 2.72 3.62 0.125
64 5.8737 4.9655 0.1224 2.72 3.22 0.079
135 5.8737 5.3419 0.0887 2.72 3.00 0.050
8640 5.8737 5.8604 0.0180 2.72 2.73 0.008
17280 5.8737 5.8655 0.0127 2.72 2.73 0.006

As obvious entropy, i.e. information content, increases increasing ${{\rm VOT}}$ i.e. quantization resolution. The entropy H distribution allows to evaluate the $C_{{\rm r}}$ rms espected from different data streams realizations:

{\rm RMS}(\mbox{$C_{{\rm r}}$ }) \approx \mbox{$C_{{\rm r}}$ } \frac{{\rm RMS}(H)}{H}\cdot
\end{displaymath} (10)

Since data will be packed in chuncks of finite length it is important not only to study the entropy distribution for the entire data-stream, which will give an indication of the overall compressibility of the data stream as a whole, but also the entropy distribution for short packets of fixed length. So each data stream was splitted into an integer number of chunks of fixed length $l_{{\rm chunck}}$. For each chunck the entropy was measured, and the corresponding distribution of entropies for the given $L_{{\rm chunck}}$ as its mean and rms was obtained. We take $l_{{\rm chunk}} = 16$, 32, 64, 135, 8640, 1728016-bits samples, so each simulated $8640 \times 60$ data stream will be splitted into $32\, 400$, $16\, 200$, 8100, 3840, 60, 30 chuncks. Small chunck sizes are introduced to study the entropy distribution as seen by most of the true compressors which do not compress one circle (8640 samples) at a time. Long chuncks distributions are usefull to understand the entropy distribution for the overall data-stream.

The entropy distribution per chunck is approximately described by a normal distribution (see Fig. 4), so the mean entropy and its rms are enough to characterize the results.

\par\includegraphics[width=8.8cm,clip]{H2223F4.eps} \end{figure} Figure 4: The entropy distribution per bunch for $l_{{\rm chunk}} = 64$ samples, for the full signal at 30 GHz

Note however that the corresponding distribution of compression rates is not exactly normally distributed, however for the sake of this analysis we will assume that even the $C_{{\rm r}}$ distribution is normally distributed.

The mean entropy measured over one scan circle ( $l_{{\rm chunk}} = 8640$ samples) coincides with the entropy measured for the full set of 60 scan circles, the entropy rms being of the order of 10-2 bits. Consequently the expected rms for $C_{{\rm r}}$ compressing one or more circles at a time will be less than $1\%$.

The mean entropy and its rms are not independent quantities. Averaged entropy decreases as $L_{{\rm chunck}}$ decreases, but correspondingly the entropy rms increases. As a consequence the averaged $C_{{\rm r}}$ decreases decreasing $L_{{\rm chunck}}$, but the fraction of chunks in which the compressor performs significantly worst than in average increases. The overall compression rate, i.e. the $C_{{\rm r}}$ referred to the full mission, being affected by them.

5.3 Normality tests

Since normal distribution of signals is assumed in Sect. 4 it would be interesting to fix how much the digitized signal distribution deviates from the normality. Also it would be important to characterize the influence of the 1/f noise and of the other signal components, especially the cosmic dipole, in the genesis of such deviations. To obtain an efficient compression it would be important that the samples are as more as possible statistically uncorrelated and normally distributed. In addition one should make sure that the detection chain does not cause any systematic effect which will introduce spurious not normal distributed components. This is relevant not only for the compression problem itself, which is among the data processing operations the least sensitive to small deviations from the normal distribution, but also in view of the future data reduction, calibration and analysis. For them the hypothesis of normality in the signal distribution is very important in order to allow a good separation of the foreground components. Last but not least, the hypothesis of conservation of normality along the detection chain, is important for the scientific interpretation of the results, since the accuracy expected from the PLANCK-LFI experiment should allow to verify if really the distribution of the CMB fluctuations at $l \gtrsim 14$ is normal, as predicted by the standard inflationary models, or as seems suggested by recent 4 years COBE/DMR results (Bromley & Tegmark 1999; Ferreira et al. 1999).

For this reason a set of normality tests was applied to the different components of the simulated signal before and after digitization in order to characterize the signal statistics and its variation along the detection process. Of course this work may be regarded as a first step in this direction, a true calibration of the signal statistics will be possible only when the front end electronics simulator will be available. Those tests have furthermore the value of a preparation to the study of the true signal.

Normality tests were applied on the same data streams used for data compression. Given on board memory limits, it is unlikely that more than a few circles at a time can be stored before compression, so statistical tests where performed regarding each data stream for a given pointing, as a collection of 60 independent realizations of the same process. Of course this is only approximately true. The 1/f noise correlates subsequent scan circles, but since its rms amplitude per sample is typically about one-tenth of the white noise rms or less, these correlations can be neglected in this analysis.

Starting from the folded data streams a given normality test was applied to each set of 60 realizations for each one of the 8640 samples, transforming the stream of samples in a stream of test results for the given test. The cumulative distribution of frequency was then computed over the 8640 test results. Since 60 samples does not represent a large statistics, significant deviations from theorethically evaluated confidence levels are expected resulting in an excessive rejection or acceptation rates. For this reason each test was calibrated applying it to the undigitized white noise data stream. Moreover, in order to analyze how the normality evolves increasing the signal complexity, tests was repeated increasing the information content of the generated data stream.

To simplify the discussion we considered as a reference test the usual Kolmogorow - Smirnov D test from Press et al. (1986) and we fix a $95\%$ acceptance level. The test was "calibrated'' using the Monte-Carlo white noise generator of our mission simulator in order to fix the threshold level $D_{{\rm th}}$ as the D value for which more than $95\%$ of our samples show $D \leq D_{{\rm th}}$. From Table 3 the quantization effect is evident, at twice the nominal quantization step ( $\mbox{${{\rm VOT}}$ } = 2$ V/K) in $30\%$ of the samples (i.e. 2592 samples) the distribution of realizations deviates from a normal distribution ( $D > D_{{\rm th}}$).


Table 3: Quantization effect on the Kolmogorow - Smirnov D test applied to simulated data, $\Delta $ is the quantization step
  $\Delta $ (mK/adu)
  1.220 0.610 0.406
${\mathcal{F}}(D < 0.1475, \; {\rm White\;Noise})$ 0.28 0.70 0.84
${\mathcal{F}}(D < 0.1475, \; {\rm Signal})$ 0.27 0.71 0.86
$D_{{\rm 95}}^{{\rm Q}} $ 0.2449 0.1851 0.1678
${\mathcal{F}}(D < D_{{\rm 95}}^{{\rm Q}}, \; {\rm Signal})$ 0.95 0.95 0.95

Since the theoretical compression rate from Eq. (9) is for a continuous distribution of levels ( $\sigma \gg \Delta$) a smaller $C_{{\rm r}}$ should be expected. Since the deviation from the normal distribution is a systematic effect, for the sake of cosmological data analysis one may tune the D test to take account of the quantization. As an example, the third line in Table 3 reports the threshold for the quantized signal $D_{{\rm th}}^{{\rm Q}}$ for which $95\%$ of the quantized white noise samples are accepted as normal distributed. The line below represents the success rate for the full quantized signal. After the recalibration the test is able to recognize that in $95\%$ of the cases the signal is drawn from a normal distribution, but at the cost of a growth in the threshold D which now is a function of the quantization step $\Delta $.

As for the entropy distribution and the binary statistics, even in this case most of the differences between the results obtained for a pure white noise signal and the full signals are explained by the presence of the cosmological dipole. However these simulations are not accurate enough to draw any quantitative conclusions about the distortion in the sampling statistics induced by digitization, but they suggest that to approximate the instrumental signal as a quantized white noise plus a cosinusoidal term associated to the cosmic dipole is more than adequate in order to understand the optimal loss-less compression rate achievable in the case of the PLANCK-LFI mission.

next previous
Up: Data streams from the efficiency

Copyright The European Southern Observatory (ESO)