Tải bản đầy đủ (.pdf) (17 trang)

Tài liệu Independent component analysis P9 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (364.42 KB, 17 trang )

9
ICA by Maximum
Likelihood Estimation
A very popular approach for estimating the independent component analysis (ICA)
model is maximum likelihood (ML) estimation. Maximum likelihood estimation is
a fundamental method of statistical estimation; a short introduction was provided in
Section 4.5. One interpretation of ML estimation is that we take those parameter
values as estimates that give the highest probability for the observations. In this
section, we show how to apply ML estimation to ICA estimation. We also show its
close connection to the neural network principle of maximization of information flow
(infomax).
9.1 THE LIKELIHOOD OF THE ICA MODEL
9.1.1 Deriving the likelihood
It is not difficult to derive the likelihood in the noise-free ICA model. This is based
on using the well-known result on the density of a linear transform, given in (2.82).
According to this result, the density of the mixture vector
(9.1)
can be formulated as
(9.2)
203
Independent Component Analysis. Aapo Hyv
¨
arinen, Juha Karhunen, Erkki Oja
Copyright
 2001 John Wiley & Sons, Inc.
ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
204
ICA BY MAXIMUM LIKELIHOOD EST IMATION
where ,andthe denote the densities of the independent components.
This can be expressed as a function of and ,giving
(9.3)


Assume that we have
observations of , denoted by . Then the
likelihood can be obtained (see Section 4 .5) as the product of this density evaluated
at the points. This is denoted by and considered as a function of :
(9.4)
Very often it is more practical to use the logarithm of the likelihood, since it is
algebraically simpler. This does not make any difference here since the maximum of
the logarithm is obtained at the same point as the maximum of the likelihood. The
log-likelihood is given by
(9.5)
The basis of the logarithm makes no difference, though in the following the natural
logarithm is used.
To simplify notation and to make it consistent to what was used in the previous
chapter, we can denote the sum over the sample index by an expectation operator,
and divide the likelihood by to obtain
(9.6)
The expectation here is not the theoretical expectation, but an average computed from
the observed sample. Of course, in the algorithms the expectations are eventually
replaced by sample averages, so the distinction is purely theoretical.
9.1.2 Estimation of the densities
Problem of semiparametric estimation
In the preceding, we have expressed
the likelihood as a function of the parameters of the model, which are the elements
of the mixing matrix. For simplicity, we used the elements of the inverse of the
mixing matrix. This is allowed since the mixing matrix can be directly computed
from its inverse.
There is anotherthing to estimate in the ICA model, though. This is the densities of
the independent components. Actually, the likelihood is a f unction of these densities
as well. This makes the problem much more complicated, because the estimation
of densities is, in general, a nonparametric problem. Nonparametric means that it

THE LIKELIHOOD OF THE ICA MODEL
205
cannot be reduced to the estimation of a finite parameter set. In fact the number of
parameters to be estimated is infinite, or in practice, very large. Thus the estimation
of the ICA model has also a nonparametric part, which is why the estimation is
sometimes called “semiparametric”.
Nonparametric estimation of densities is known to be a d ifficult problem. Many
parameters are always more difficult to estimate than just a f ew; since nonparametric
problems have an infinite number of parameters,they are the most difficult to estimate.
This is why we would like to avoid the nonparametric density estimation in the ICA.
There are two ways to avoid it.
First, in some cases we might know the densities of the independent components
in advance, using some prior knowledge on the data at hand. In this case, we could
simply use these prior densities in the likelihood. Then the likelihood would really
be a function of only. If reasonably small errors in the specification of these prior
densities have little influence on the estimator, this procedure will give reasonable
results. In fact, it will be shown below that this is the case.
A second way to solve the problem of density estimation is to approximate the
densities of the independent components by a family of densities that are specified
by a limited number of parameters. If the number of parameters in the density family
needs to be very large, we do not gain much from this approach, since the goal was
to reduce the number of parameters to be estimated. However, if it is possible to use
a very simple family of densities to estimate the ICA model for any densities ,we
will get a simple solution. Fortunately, this turns out to be the case. We can use an
extremely simple parameterization of the , consisting of the choice between two
densities, i.e., a single binary parameter.
A simple density family
It turns out that in maximum likelihood estimation, it is
enough to use just two approximations of the density of an independent component.
For each independent component, we just need to determine which one of the two

approximations is better. This shows that, first, we can make small errors when we
fix the densities of the independent components, since it is enough that we use a
density that is in the same half of the space of probability densities. Second, it shows
that we can estimate the independent components using very simple models of their
densities, in particular, using models consisting of only two densities.
This situation can be compared with the one encountered in Section 8.3.4, where
we saw that any nonlinearity can be seen to divide the space of probability distributions
in half. When the distribution of an independent component is in one of the halves,
the nonlinearity can be used in the gradient method to estimate that independent
component. When the distribution is in the other half, the negative of the nonlinearity
must be used in the gradient method. In the ML case, a nonlinearity corresponds to
a density approximation.
The validity of these approaches is shown in the following theorem, whose proof
can be found in the appendix. This theorem is basically a corollary of the stability
theorem in Section 8.3.4.
206
ICA BY MAXIMUM LIKELIHOOD ESTIMATION
Theorem 9.1 Denote by the assumed densities of the independent components,
and
(9.7)
Constrain the estimates of the independent components
to be uncorrelated
and to have unit variance. Then the ML estimator is locally consistent, if the assumed
densities fulfill
(9.8)
for all
.
This theorem shows rigorously that small misspecifications in the densities
do
not affect the local consistency of the ML estimator, since sufficiently small changes

do not change the sign in (9.8).
Moreover, the theorem shows how to construct families consisting of only two
densities, so that the condition in (9.8) is true for one of these densities. For example,
consider the following log-densities:
(9.9)
(9.10)
where are positive parameters that are fixed so as to make these two functions
logarithms of probability densities. Actually, these constants can be ignored in the
following. The factor 2 in (9.9) is not important, but it is u sually used here; also, the
factor
in (9.10) could be changed.
The motivation for these functions is that is a supergaussian density, because
the
function is close to the absolute value that would g ive the Laplacian
density. The density given by is subgaussian, because it is like a gaussian log-
density, plus a constant, that has been somewhat “flattened” by the
function.
Simple computations show that the value of the nonpolynomial moment in (9.8)
is for
(9.11)
and for it is
(9.12)
since the derivative of equals ,and by definition.
We see that the signs of these expressions are always opposite. Thus, for practically
any distributions of the , one of these functions fulfills the condition, i.e., has the
desired sign, and estimation is possible. Of course, for some distribution of the
the nonpolynomial moment in the condition could be zero, which corresponds to the
ALGORITHMS FOR MAXIMUM LIKELIHOOD ESTIMATION
207
case of zero kurtosis in cumulant-based estimation; such cases can be considered to

be very rare.
Thus we can just compute the nonpolynomial moments for the two prior distribu-
tions in (9.9) and (9.10), and choose the one that fulfills the stability condition in (9.8).
This can be done on-line during the maximization of the likelihood. This always
provides a (locally) consistent estimator, and solves the problem of semiparametric
estimation.
In fact, the nonpolynomial moment in question measures the shape of the density
function in much the same way as kurtosis. For , we would actually
obtain kurtosis. Thus, the choice of nonlinearity could be compared with the choice
whether to minimize or maximize kurtosis, as previously encountered in Section 8.2.
That choice was based on the value of the sign of kurtosis; here we use the sign of a
nonpolynomial moment.
Indeed, the nonpolynomial moment of this chapter is the same as the one encoun-
tered in Section 8.3 when using more general measures of nongaussianity. However,
it must be noted that the set of nonlin earities that we can use here is more restricted
than those used in Chapter 8. This is because the nonlinearities
used must corre-
spond to the derivative of the logarithm of a probability density function (pdf). For
example, we cannot use the function
because the corresponding pdf would
be of the form , and this is not integrable, i.e., it is not a pdf at all.
9.2 ALGORITHMS FOR MAXIMUM LIKELIHOOD ESTIMATION
To perform maximum likelihood estimation in practice, we need an algorithm to
perform the numerical maximization of likelihood. In this section, we discuss dif-
ferent methods to this end. First, we show how to derive simple gradient algorithms,
of which especially the natural gradient algorithm has been widely used. Then we
show how to derive a fixed-point algorithm, a version of FastICA, that maximizes the
likelihood faster and more reliably.
9.2.1 Gradient algorithms
The Bell-Sejnowski algorithm

The simplest algorithms for maximizing likeli-
hood are obtained by gradient methods. Using the well-known results in Chapter 3,
one can easily derive the stochastic gradient of the log-likelihood in (9.6) as:
(9.13)
Here, is a component-wise vector function that consists
of the so-called (negative) score functions of the distributions of , defined as
(9.14)
208
ICA BY MAXIMUM LIKELIHOOD ESTIMATION
This immediately gives the following algorithm for ML estimation:
(9.15)
A stochastic version of this algorithm could be used as well. This means that the
expectation is omitted, and in each step of the algorithm, only one data point is used:
(9.16)
This algorithm is often called the Bell-Sejnowski algorithm. It was first derived in
[36], though from a different approach using the infomax principle that is explained
in Section 9.3 below.
The algorithm in Eq. (9.15) converges very slowly, however, especially due to
the inversion of the matrix that is needed in every step. The convergence can be
improved by whitening the data, and especially by using the natural gradient.
The natural gradient algorithm
The natural (or relative) gradient method sim-
plifies the maximization of the likelihood considerably, and makes it better condi-
tioned. The principle of the natural gradient is based on the geometrical structure of
the parameter space, and is related to the principle of relative gradient, which uses
the Lie group structure of the ICA problem. See Chapter 3 for more details. In the
case of basic ICA, both of these principles amount to multiplying the right-hand side
of (9.15) by . Thus we obtain
(9.17)
Interestingly, this algorithm can be interpreted as nonlinear decorrelation.This

principle will be treated in more detail in Chapter 12. The idea is that the algo-
rithm converges when , which means that the and are
uncorrelated for . This is a nonlinear extension of the ordinary requirement
of uncorrelatedness, and, in fact, this algorithm is a special case of the nonlinear
decorrelation algorithms to be introduced in Chapter 12.
In practice, one can use, for example, the two densities described in Section 9.1.2.
For supergaussian independent components, the pdf defined by (9.9) is usually used.
This means that the component-wise nonlinearity is the tanh function:
(9.18)
For subgaussian independent components, other functions must be used. For exam-
ple, one could use the pdf in (9.10), which leads to
(9.19)
(Another possibility is to use for subgaussian components.) These
nonlinearities are illustrated in Fig. 9.1.
The choice between the two nonlinearities in (9.18) and (9.19) can be made by
computing the nonpolynomial moment:
(9.20)
ALGORITHMS FOR MAXIMUM LIKELIHOOD ESTIMATION
209
Fig. 9.1
The functions in Eq. (9.18) and in Eq. (9.19), given by the solid line and the
dashed line, respectively.
using some estimates of the independent components. If this nonpolynomial moment
is positive, the nonlinearity in (9.18) should be used, otherwise the nonlinearity in
(9.19) should be used. This is because of the condition in Theorem 9.1.
The choice of nonlinearity can be made while running the gradient algorithm,
using the running estimates of the independent components to estimate the nature of
the independent components (that is, the sign of the nonpolynomial moment). Note
that the use of the polynomial moment requires that the estimates of the independent
components are first scaled properly, constraining them to unit variance, as in the

theorem. Such normalizations are often omitted in practice, which may in some cases
lead to situations in which the wrong nonlinearity is chosen.
The resulting algorithmis recapitulated in Table 9.1. In this version, whitening and
the above-mentioned normalization in the estimation of the nonpolynomial moments
are omitted; in practice, these may be very useful.
9.2.2 A fast fixed-point algorithm
Likelihood can be maximized by a fixed-point algorithm as well. The fixed-point
algorithm given by FastICA is a very fast and reliable maximization method that was
introduced in Chapter 8 to maximize the measures of nongaussianity used for ICA
estimation. Actually, the FastICA algorithm can be directly applied to maximization
of the likelihood.
The FastICA algorithm was derived in Chapter 8 for optimization of
under the constraint of the unit norm of . In fact, maximization of likelihood gives
us an almost identical optimization problem, if we constrain the estimates of the
independent components to b e white (see Chapter 7). In particular, this implies that
the term is constant, as proven in the Appendix, and thus the likelihood
basically consists of the sum of terms of the form optimized by FastICA. Thus
210
ICA BY MAXIMUM LIKELIHOOD ESTIMATION
1. Center the data to make its mean zero
2. Choose an initial (e.g., random) separating matrix
. Choose initial values
of , either randomly or using prior information. Choose the
learning rates and .
3. Compute .
4. If the nonlinearities are not fixed a priori:
(a) update .
(b) if ,define as in (9.18), otherwise define it as in (9.19).
5. Update the separating matrix by
(9.21)

where .
6. If not converged, go back to step 3.
Table 9.1
The on-line stochastic natural gradient algorithm for maximum likelihood esti-
mation. Preliminary whitening is not shown here, but in practice it is highly recommended.
we could use directly the same kind of derivation of fixed-point iteration as used in
Chapter 8.
In Eq. (8.42) in Chapter 8 we had the following form of the FastICA algorithm
(for whitened data):
(9.22)
where can be computed from (8.40) as . If we write this in
matrix form, we obtain:
diag diag (9.23)
where is defined as ,and . To express this using
nonwhitened data, as we have done in this chapter, it is enough to multiply both sides
of (9.23) from the right by the whitening matrix. This means simply that we replace
the by , since we have which implies .
Thus, we obtain the basic iteration of FastICA as:
diag diag (9.24)
where , ,and .
After every step, the matrix must be projected on the set of whitening matrices.
This can be accomplished by the classic method involving matrix square roots,
(9.25)
THE INFOMAX PRINCIPLE
211
where is the correlation matrix of the data (see exercises). The inverse
square root is obtained as in (7.20). For alternative methods, see Section 8.4 and
Chapter 6, but note that those algorithms require that the data is prewhitened, since
they simply orthogonalize the matrix.
This version of FastICA is recapitulated in Table 9.2. FastICA could be compared

with the natural gradient method for maximizing likelihood given in (9.17). Then
we see that FastICA can be considered as a computationally optimized version of the
gradient algorithm. In FastICA, convergence speed is optimized by the choice of the
matrices diag and diag . These two matrices give an optimal step size to be
used in the algorithm.
Another advantage of FastICA is that it can estimate both sub- and supergaussian
independent components without any additional steps: We can fix the nonlinearity
to be equal to the nonlinearity for all the independent components. The reason
is clear from (9.24): The matrix diag
contains estimates on the nature (sub- or
supergaussian) of the independent components. These estimates are used as in the
gradient algorithm in the previous subsection. On the other hand, the matrix diag
can be considered as a scaling of the nonlinearities, since we could reformulate
diag diag diag . Thus we can
say that FastICA uses a richer parameterization of the densities than that used in
Section 9.1.2: a parameterized family instead of just two densities.
Note that in FastICA, the outputs
are decorrelated and normalized to unit
variance after every step. No such operations are needed in the gradient algorithm.
FastICA is not stable if these additional operations are omitted. Thus the optimization
space is slightly reduced.
In the version given here, no preliminary whitening is done. In practice, it is often
highly recommended to do prewhitening, possibly combined with PCA dimension
reduction.
9.3 THE INFOMAX PRINCIPLE
An estimation principle for ICA that is very closely related to maximum likelihood
is the infomax principle [282, 36]. This is based on maximizing the output entropy,
or information flow, of a neural network with nonlinear outputs. Hence the name
infomax.
Assume that is the input to the neural network whose outputs are of the form

(9.31)
where the are some nonlinear scalar functions, and the are the weight vectors
of the neurons. The vector is additive gaussian white noise. One then wants to
maximize the entropy of the outputs:
(9.32)
This can be motivated by considering information flow in a n eural network. Efficient
information transmission requires that we maximize the mutual information between
212
ICA BY MAXIMUM LIKELIHOOD ESTIMATION
1. Center the data to make its mean zero. Compute correlation matrix
.
2. Choose an initial (e.g., random) separating matrix
.
3. Compute
(9.26)
for (9.27)
for (9.28)
4. Update the separating matrix by
diag diag (9.29)
5. Decorrelate and normalize by
(9.30)
6. If not converged, go back to step 3.
Table 9.2
The FastICA algorithm for maximum likelihood estimation. This is a version
without whitening; in practice, whitening combined with PCA may often be useful. The
nonlinear function is typically the function.
EXAMPLES
213
the inputs and the outputs . This problem is meaningful only if there is some
information loss in the transmission. Therefore, we assume that there is some noise

in the network. It can then be shown (see exercices) that in the limit of no noise (i.e.,
with infinitely weak noise), maximization of this mutual information is equivalent to
maximization of the output entropy in (9.32). For simplicity, we therefore assume in
the following that the noise is of zero variance.
Using the classic formula of the entropy of a transformation (see Eq. (5.13) we
have
(9.33)
where denotes the function defined by the neural
network. We can simply calculate the derivative to obtain
(9.34)
Now we see that the output entropy is of the same form as the expectation o f the
likelihood as in Eq. 9.6. The pdf’s of the independent components are here replaced
by the functions . Thus, if the nonlinearities used in the neural network are
chosen as the cumulative distribution functions corresponding to the densities , i.e.,
, the output entropy is actually equal to the likelihood. This means that
infomax is equivalent to maximum likelihood estimation.
9.4 EXAMPLES
Here we show the results of applying maximum likelihood estimation to the two
mixtures introduced in Chapter 7. Here, we use whitened data. This is not strictly
necessary, but the algorithms converge much better with whitened data. The algo-
rithms were always initialized so that was the identity matrix.
First, we used the natural gradient ML algorithm in Table 9.1. In the first ex-
ample, we used the data consisting of two mixtures of two subgaussian (uniformly
distributed) independent components, and took the nonlinearity to be the one in
(9.18), corresponding to the density in (9.9). The algorithm did not converge prop-
erly, as shown in Fig. 9.2. This is because the nonlinearity was not correctly chosen.
Indeed, computing the nonpolynomial moment (9.20), we saw that it was negative,
which means that the nonlinearity in (9.19) should have been used. Using the correct
nonlinearity, we obtained correct convergence, as in Fig. 9.3. In both cases, several
hundred iterations were performed.

Next we did the corresponding estimation for two mixtures of two supergaussian
independent components. This time, the nonlinearity in (9.18) was the correct
one, and gave the estimates in Fig. 9.4. This could be checked by computing the
nonpolynomial moment in (9.20): It was positive. In contrast, using the nonlinearity
in (9.19) gave completely wrong estimates, as seen in Fig. 9.5.
214
ICA BY MAXIMUM LIKELIHOOD ESTIMATION
In contrast to the gradient algorithm, FastICA effortlessly finds the independent
components in both cases. In Fig. 9.6, the results are shown for the subgaussian data,
and in Fig. 9.7, the results are shown for the supergaussian data. In both cases the
algorithm converged correctly, in a couple of iterations.
9.5 CONCLUDING REMARKS AND REFERENCES
Maximum likelihood estimation, perhaps the most commonly used statistical esti-
mation principle, can be used to estimate the ICA model as well. It is closely related
to the infomax principle used in neural network literature. If the densities of the
independent components are known in advance, a very simple gradient algorithm can
be derived. To speed up convergence, the natural gradient version and especially the
FastICA fixed-point algorithm can be used . If the densities of the independent com-
ponents are not known, the situation is somewhat more complicated. Fortunately,
however, it is enough to use a very rough density approximation. In the extreme
case, a family that contains just two densities to approximate the densities of the
independent components is enough. The choice of the density can then be based
on the information whether the independent components are sub- or supergaussian.
Such an estimate can be simply added to the gradient methods, and it is automatically
done in FastICA.
The first approaches to using maximum likelihood estimation for ICA were in
[140, 372]; see also [368, 371]. This approach became very popular after the
introduction of the algorithm in (9.16) by Bell and Sejnowski, who derived it using the
infomax principle [36]; see also [34]. The connection between these two approaches
was later proven by [64, 322, 363]. The natural gradient algorithm in (9.17) is

sometimes called the Bell-Sejnowski algorithm as well. However, the natural gradient
extension was actually introduced only in [12, 3]; for the underlying theory, see
[4, 11, 118]. This algorithm is actually almost identical to those introduced previously
[85, 84] based on nonlinear decorrelation, and quite similar to the one in [255, 71]
(see Chapter 12). In particular, [71] used the relative gradient approach, which in this
case is closely related to the natural gradient; see Chapter 14 for more details. Our
two-density family is closely related to those in [148, 270]; for alternative approaches
on modeling the distributions of the ICs, see [121, 125, 133, 464].
The stability criterion in Theorem 9.1 has been presented in different forms by
many authors [9, 71, 67, 69, 211]. The different forms are mainly due to the
complication of different normalizations, as discussed in [67]. We chose to normalize
the components to unit variance, which gives a simple theorem and is in line with
the approach of the other chapters. Note that in [12], it was proposed that a single
very high-order polynomial nonlinearity could be used as a universal nonlinearity.
Later research has shown that this is not possible, since we need at least two different
nonlinearities, as discussed in this chapter. Moreover, a high-order polynomial leads
to very nonrobust estimators.
CONCLUDING REMARKS AND REFERENCES
215
Fig. 9.2
Problems of convergence with the (natural) gradient method for maximum likeli-
hood estimation. The data was two whitened mixtures of subgaussian independent components.
The nonlinearity was the one in (9.18), which was not correct in this case. The resulting es-
timates of the columns of the whitened mixing matrix are sho wn in the figure: they are not
aligned with the edges of the square, as they should be.
Fig. 9.3
The same as in Fig. 9.2, but with the correct nonlinearity, given by (9.19). This
time, the natural gradient algorithm gave the right result. The estimated vectors are aligned
with the edges of the square.
216

ICA BY MAXIMUM LIKELIHOOD ESTIMATION
Fig. 9.4
In this experiment, data was two whitened mixtures of supergaussian independent
components. The nonlinearity was the one in (9.18). The natural gradient algorithm converged
correctly.
Fig. 9.5
Again, problem of convergence with the natural gradient method for maximum
likelihood estimation. The nonlinearity was the one in (9.19), which was not correct in this
case.
CONCLUDING REMARKS AND REFERENCES
217
Fig. 9.6
FastICA automatically estimates the nature o f the independent components, and
converges f ast to the maximum lik elihood solution. Here, the sol ution was found in 2 iterations
for subgaussian independent components.
Fig. 9.7
FastICA this time applied on supergaussian mixtures. Again, the solution was
found in 2 iterations.
218
ICA BY MAXIMUM LIKELIHOOD ESTIMATION
Problems
9.1 Derive the likelihood in (9.4).
9.2 Derive (9.11) and (9.12).
9.3 Derive the gradient in (9.13).
9.4 Instead of the function in (9.19), one could use the function .Show
that this corresponds to a subgaussian distribution by computing the kurtosis of the
distribution. Note the normalization constants involved.
9.5 After the preceding problem, one might be tempted to use for super-
gaussian variables. Why is this not correct in the maximum likelihood framework?
9.6 Take a linear . What is the interpretation of this in the ML frame-

work? Conclude (once again) that
must be nonlinear.
9.7 Assume that you use the general function family instead
the simple function in (9.18), where is a constant. What is the
interpretation of in the likelihood framework?
9.8 Show that for a gaussian random variable, the nonpolynomial moment in (9.8)
is zero for any
.
9.9 The difference between using or in the nonlinearity
(9.18) is a matter of normalization. Does it make any difference in the algorithms?
Consider separately the natural gradient algorithm and the FastICA algorithm.
9.10 Show that maximizing the mutual information of inputs and outputs in a
network of the form
(9.35)
where is gaussian noise, and the output and input spaces have the same dimension,
as in (9.31), is equivalent to maximizing the entropy of the outputs, in the limit of no
zero noise level. (You can compute the joint entropy of inputs and outputs using the
entropy transformation formula. Show that it is constant. Then set the noise level to
infinitely small.)
9.11 Show that after (9.25), is white.
Computer assignments
9.1 Take random variables of (1) uniform and (2) Laplacian distribution. Compute
the values of the nonpolynomial moment in (9.8), for different nonlinearities .Are
the moments of different signs for any nonlinearity?
9.2 Reproduce the experiments in Section 9.4.
APPENDIX
219
9.3 The densities of the independent components could be modeled by a density
family given by
(9.36)

where
and are normalization constants to make this a pdf of unit variance. For
different values of , ranging from 0 to infinity, we get distributions of different
properties.
9.3.1. What happens when we have ?
9.3.2. Plot the pdf’s, the logarithms of the pdf’s, and the corresponding score
functions for the following values of : 0.2,1,2,4,10.
9.3.3. Conclude that for , we have subgaussian densities, and for ,
we have supergaussian densities.
Appendix proofs
Here we prov e Theorem 9.1. Looking at the expectation of log-likelihood, using assumed
densities
:
(A.1)
we see that the first term on the right-hand side is a sum of terms of the form
,as
in the stability theorem in Section 8.3.4. Using that theorem, we see immediately that the first
term is maximized when gives independent components.
Thus if we prove that the second term remains constant under the conditions of the theorem,
the theorem is proven. Now, uncorrelatedness and unit variance of the
means
, which implies
(A.2)
and this implies that
must be constant. Thus the theorem is proven.

×