A realistic estimation of the compression efficiency must be based on a quantitative analysis of the signal statistics, which includes: statistics of the binary representation (Sect. 5.1), entropy (Sect. 5.2) and normality tests (Sect. 5.3).
Figure 3 represents the frequency distribution of
symbols when the full data stream of 60 scan circles is divided
into 8 bits words. Since for most of the samples the range spans
over
levels (5 bits) only the bytes corresponding to
the MSB words assume a limited range of values producing the
narrow spike in the figure.
From the distribution in Fig. 3 one may wonder if it would not be possible to obtain a more effective compression splitting the data stream into two substreams: the MSB substream (with compression efficiency ) and the LSB substream (with compression efficiency ). Since the two components are so different in their statistics, with the MSB substream having an higher level of redundancy than the original data stream, it would be reasonable to expect that the final compression rate be greater than the compression rate obtained compressing directly the original data stream. We tested this procedure taking some of the compressors considered for the final test. From these tests it is clear that but since most of the redundancy of the original data stream is contained in the MSB substream the LSB substream can not be compressed in an effective way, as a result and . So the best way to perform an efficient compression is to apply the compressor to the full stream without performing the MSB/LSB separation. Apart from these theoretical considerations, we performed some tests with our simulated data stream confirming these result.
Equation (9) is valid in the limit of a continuous distribution of quantization levels. Since in our case the quantization step is about one tenth of the signal rms this is no longer true. To properly estimate the maximum compression rate attainable from these data we evaluate the entropy of the discretized signal using different values of the .
Our entropy evaluation code takes the input data stream and determines the frequency fs of each symbol s in the quantized data stream and computing the entropy as: where s is the symbol index. In our simulation we take both 8 and 16 bits symbols (s spanning over 0, , 255 and 0, , ). Since in our scheme the ADC output is 16 bits, we considered 8 bits symbols entropy both for the LSB and MSB 8 bits word and 8 bits entropy after merging the LSB and MSB significant bits set.
As expected, since merely shifts the quantized signal distribution, entropy does not depend on . For this reason we take V, i.e., no shift.
Table 2 reports the 16 bits entropy as a function
of
,
composition and frequency.
30 GHz, White Noise | ||||||
Entropy (bits) | ||||||
Total | Mean | RMS | Total | Mean | RMS | |
16 | 5.1618 | 3.5596 | 0.1989 | 3.10 | 4.49 | 0.251 |
32 | 5.1618 | 4.1815 | 0.1658 | 3.10 | 3.83 | 0.152 |
64 | 5.1618 | 4.6108 | 0.1262 | 3.10 | 3.47 | 0.095 |
135 | 5.1618 | 4.8791 | 0.0890 | 3.10 | 3.28 | 0.060 |
8640 | 5.1618 | 5.1561 | 0.0114 | 3.10 | 3.10 | 0.007 |
17280 | 5.1618 | 5.1589 | 0.0061 | 3.10 | 3.10 | 0.004 |
30 GHz, Full Signal | ||||||
Entropy (bits) | ||||||
Total | Mean | RMS | Total | Mean | RMS | |
16 | 5.5213 | 3.5602 | 0.1982 | 2.90 | 4.49 | 0.250 |
32 | 5.5213 | 4.1849 | 0.1664 | 2.90 | 3.82 | 0.152 |
64 | 5.5213 | 4.6162 | 0.1278 | 2.90 | 3.47 | 0.096 |
135 | 5.5213 | 4.8885 | 0.0893 | 2.90 | 3.27 | 0.060 |
8640 | 5.5213 | 5.5119 | 0.0176 | 2.90 | 2.90 | 0.009 |
17280 | 5.5213 | 5.5157 | 0.0118 | 2.90 | 2.90 | 0.006 |
100 GHz, White Noise | ||||||
Entropy (bits) | ||||||
Total | Mean | RMS | Total | Mean | RMS | |
16 | 5.7436 | 3.6962 | 0.1740 | 2.79 | 4.33 | 0.204 |
32 | 5.7436 | 4.4174 | 0.1521 | 2.79 | 3.62 | 0.125 |
64 | 5.7436 | 4.9627 | 0.1230 | 2.79 | 3.22 | 0.080 |
135 | 5.7436 | 5.3354 | 0.0875 | 2.79 | 3.00 | 0.049 |
8640 | 5.7436 | 5.7352 | 0.0115 | 2.79 | 2.79 | 0.006 |
17280 | 5.7436 | 5.7394 | 0.0063 | 2.79 | 2.79 | 0.003 |
100 GHz, Full Signal | ||||||
Entropy (bits) | ||||||
Total | Mean | RMS | Total | Mean | RMS | |
16 | 5.8737 | 3.6970 | 0.1734 | 2.72 | 4.33 | 0.203 |
32 | 5.8737 | 4.4186 | 0.1526 | 2.72 | 3.62 | 0.125 |
64 | 5.8737 | 4.9655 | 0.1224 | 2.72 | 3.22 | 0.079 |
135 | 5.8737 | 5.3419 | 0.0887 | 2.72 | 3.00 | 0.050 |
8640 | 5.8737 | 5.8604 | 0.0180 | 2.72 | 2.73 | 0.008 |
17280 | 5.8737 | 5.8655 | 0.0127 | 2.72 | 2.73 | 0.006 |
The entropy distribution per chunck is approximately described by a
normal distribution (see Fig. 4),
so the mean entropy and its rms are enough to characterize the results.
The mean entropy measured over one scan circle ( samples) coincides with the entropy measured for the full set of 60 scan circles, the entropy rms being of the order of 10-2 bits. Consequently the expected rms for compressing one or more circles at a time will be less than .
The mean entropy and its rms are not independent quantities. Averaged entropy decreases as decreases, but correspondingly the entropy rms increases. As a consequence the averaged decreases decreasing , but the fraction of chunks in which the compressor performs significantly worst than in average increases. The overall compression rate, i.e. the referred to the full mission, being affected by them.
Since normal distribution of signals is assumed in Sect. 4 it would be interesting to fix how much the digitized signal distribution deviates from the normality. Also it would be important to characterize the influence of the 1/f noise and of the other signal components, especially the cosmic dipole, in the genesis of such deviations. To obtain an efficient compression it would be important that the samples are as more as possible statistically uncorrelated and normally distributed. In addition one should make sure that the detection chain does not cause any systematic effect which will introduce spurious not normal distributed components. This is relevant not only for the compression problem itself, which is among the data processing operations the least sensitive to small deviations from the normal distribution, but also in view of the future data reduction, calibration and analysis. For them the hypothesis of normality in the signal distribution is very important in order to allow a good separation of the foreground components. Last but not least, the hypothesis of conservation of normality along the detection chain, is important for the scientific interpretation of the results, since the accuracy expected from the PLANCK-LFI experiment should allow to verify if really the distribution of the CMB fluctuations at is normal, as predicted by the standard inflationary models, or as seems suggested by recent 4 years COBE/DMR results (Bromley & Tegmark 1999; Ferreira et al. 1999).
For this reason a set of normality tests was applied to the different components of the simulated signal before and after digitization in order to characterize the signal statistics and its variation along the detection process. Of course this work may be regarded as a first step in this direction, a true calibration of the signal statistics will be possible only when the front end electronics simulator will be available. Those tests have furthermore the value of a preparation to the study of the true signal.
Normality tests were applied on the same data streams used for data compression. Given on board memory limits, it is unlikely that more than a few circles at a time can be stored before compression, so statistical tests where performed regarding each data stream for a given pointing, as a collection of 60 independent realizations of the same process. Of course this is only approximately true. The 1/f noise correlates subsequent scan circles, but since its rms amplitude per sample is typically about one-tenth of the white noise rms or less, these correlations can be neglected in this analysis.
Starting from the folded data streams a given normality test was applied to each set of 60 realizations for each one of the 8640 samples, transforming the stream of samples in a stream of test results for the given test. The cumulative distribution of frequency was then computed over the 8640 test results. Since 60 samples does not represent a large statistics, significant deviations from theorethically evaluated confidence levels are expected resulting in an excessive rejection or acceptation rates. For this reason each test was calibrated applying it to the undigitized white noise data stream. Moreover, in order to analyze how the normality evolves increasing the signal complexity, tests was repeated increasing the information content of the generated data stream.
To simplify the discussion we considered as a reference test the
usual Kolmogorow - Smirnov D test from
Press et al. (1986) and we fix a
acceptance
level. The test was "calibrated'' using the Monte-Carlo white
noise generator of our mission simulator in order to fix the
threshold level
as the D value for which
more than
of our samples show
.
From Table 3 the quantization effect is evident, at
twice the nominal quantization step (
V/K) in
of the
samples (i.e. 2592 samples) the distribution of realizations
deviates from a normal distribution (
).
(mK/adu) | |||
1.220 | 0.610 | 0.406 | |
0.28 | 0.70 | 0.84 | |
0.27 | 0.71 | 0.86 | |
0.2449 | 0.1851 | 0.1678 | |
0.95 | 0.95 | 0.95 |
As for the entropy distribution and the binary statistics, even in this case most of the differences between the results obtained for a pure white noise signal and the full signals are explained by the presence of the cosmological dipole. However these simulations are not accurate enough to draw any quantitative conclusions about the distortion in the sampling statistics induced by digitization, but they suggest that to approximate the instrumental signal as a quantized white noise plus a cosinusoidal term associated to the cosmic dipole is more than adequate in order to understand the optimal loss-less compression rate achievable in the case of the PLANCK-LFI mission.
Copyright The European Southern Observatory (ESO)