Solutions to exercise problems, Tik-61.181, autumn 98.

Tik-61.181 Special Course on IT I, autumn 98

The maximum amount of points is 30. Your total amount of points will be converted to a grade as follows:

Points:    Grade:
0-15       0
16-18      1
19-21      2
22-24      3
25-27      4
28-30      5

You must get at least grade 1 to pass the course.

Problems:

1.1
max 3p A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p(1) = 0.005 and p(0) = 0.995. The digits are taken 100 at a time and a binary codeword is provided for every sequence of 100 digits containing three of fewer ones.

(a) Assuming that all codewords are the same length, find the minimum length required to provide codewords for all sequences with three or fewer ones.
You need to compute how many sequences have 0-3 ones. This number is 1+100+100*99/2+100*99*98/6, which is less than 2^18. Therefore 18 bits is enough.

(b) Calculate the probability of observing a source sequence for which no codeword has been assigned.
This is the probability of a sequence, which has 4 or more ones. This equals one minus the probability of a sequence with 0-3 ones, i.e, 1 - (0.995)^100 - 100*0.005*(0.995)^99 - 100*99/2*(0.005)^2(0.995)^98 - 100*99*98/6*(0.005)^3(0.995)^97, approximately 1.7*10^(-3).

(c) Use Chebyshev's inequality to bound the probability of observing a source sequence for which no codeword has been assigned. Compare this bound with the actual probability computed in part (b).
We use the Chebyshev inequality P(t>=a)<=E[t]/a. If we denote t as the number of ones in the sequence, we get
P(t>=4)<=E[t]/4=0.005*100/4=0.5/4=0.125
This is much larger than the result of b). The other Chebyshev inequality can also be used, resulting in P(t>=4)<=0.0406... We omit the details.

1.2
max 2p Show that the entropy of the probability distribution, (p₁, ..., p_i, ..., p_j, ..., p_m), is less than the entropy of the distribution (p₁, ..., (p_i+p_j)/2, ..., (p_i+p_j)/2, ..., p_m). Show that in general any transfer of probability that makes the distribution more uniform increases the entropy.
We only need to look at the difference between the entropies of the mentioned probability distributions. Define f(p)=-p log p. Then the difference is
2*f(p_i+p_j)-f(p_i)-f(p_j)
which we need to show to be less than or equal to zero:
f((p_i+p_j)/2)<= 1/2 (f(p_i)+f(p_j))
This follows if f is convex. Since df/dp=log p+1, it follows that d^2 f/dp^2=1/p>0 when p>0, hence f is convex.
We need to define "more uniform". This is conveniently done using Kullback-Leibler divergence D(p,u) where u is the uniform distribution with m discrete probabilities 1/m. If D(p,u)>D(q,u), it means that q is "more uniform" than p. Since
D(q,u)=sum_i q_i log (q_i/(1/m))=-H(q)- sum_i q_i log m^^-1=-H(q)+log m
the inequality turns out to be -H(p)>-H(q) or H(p)<H(q) as desired.

2.1
max 2p Determine a Huffman code for the text of this problem.
Answer omitted, the code varies depending whether you include spaces or not etc...

2.2
max 2p Consider all possible 256×256 pixel images with 8 bit grayscale pixels. How many images can be compressed by 20 bits or more using a uniquely decodable code? Compare the result with the number of all possible images.
Each image has n=256*256*8 bits, then there are 2^n possible images. Denote l=n-20, which is the largest codelength of an image that fulfills the requirement given in the problem. Then Kraft-inequality gives an upper bound k as k*2^(-l)<=1 or k=2^l=2^(n-20). This is an upper bound, since shorter codelengths m<l would correspond to larger terms 2^(-m) in the Kraft inequality.
To compare k with 2^n, divide k/2^n=2^(n-20)/2^n=2^(-20), which is approximately 0.000095 percent. So compression is difficult!

2.3
max 2p Assume there is a source emitting symbols a and b with probability P(a | ß)=ß. Given a sequence s with N_a symbols a and N_b symbols b, what is the probability P(a | s) if p(ß) = [ß(1-ß)]^-1/2 / pi.
The problem is otherwise identical to the one solved in MacKay's book in section 4.2.4, except that the prior is different. The solution is (N_a + 1/2) / (N_a + N_b + 1).

3.1
max 2p The noisy channel coding theorem states that for a discrete memoryless channel, it is possible to achieve a finite rate of transmission with arbitrarily small error probability. Discuss the reasons why in real applications higher error probabilities are unavoidable.
Some reasons: the noisy channel coding theorem assumes, that the code is a block code and the block length can be chosen arbitrarily large. This is not reasonable in practice.
The channel capacity is assumed to be constant and known. This does not always hold in practice.

3.2
max 2p Which distributions p(x) maximize the differential entropy of x for each of the following constraints:

E(|x|)=1

Absolute value |x| is at most one

E(x^2)=1
Also calculate the maximum value of the entropy.
The task is to maximize an integral over -p(x)log(p(x)) under constraints. The constraints can be expressed by integrals over p(x)f(x) = constant. The first constraint is that p(x) has to integrate to unity (f(x) = 1) and the next constraint is the one given in the problem. In the first problem f(x) = |x| and in the third problem f(x) = x^2. In the second problem the domain of x is [-1,1] but there are no other constraints.
Using the method of Lagrange multipliers one can show that a general form of the solution is exp(c₁f₁(x) + c₂f₂(x) + ...), where c_i are the Lagrange multipliers, one for each constraint.
The solutions are:

p(x) = exp(-|x|)/2 (Laplace distribution), entropy ln 2e.

p(x) = 1/2 (even distribution), entropy ln 2.

p(x) = exp(-x^2 / 2)/sqrt(2 pi) (normal distribution), entropy 1/2 ln 2 pi e.

4.1
max 2p Recompute the posterior probability (10.33) in the book by using the given conjugate priors for the mean and variance.
Use the conjugate priors and add them into the posterior probability according to Bayes' theorem. Details omitted here.

4.2
max 2p With Laplace's method, what is the approximation for the posterior probability density if the posterior has n local maxima at x₁, x₂, ..., x_n, the corresponding joint densities of x_i and the data D are p(x₁, D), ..., p(x_n, D) and the corresponding Hessian matrices of -log p(x_i, D) are H₁, ..., H_n.
The solution is obtained as explained in the supplement to the problem. It is .

5.1
max 2p Estimate in bits the total sensory experience that you have had in your life - visual information, auditory information etc. Estimate how much information you have memorized. Compare these with the capacity of your brain assuming you have 10¹¹ neurons each making 1000 synaptic connections and that the capacity result for one neuron (two bits per connection) applies. Is your brain full yet?
You need to use heavy approximations. First, ignore everything but visual information; other sensory information should be a small fragment of the visual information. Estimate your lifetime, say, in seconds. Use only 1 significant digit:
30yrs*400days*20hrs*60minutes*60seconds
is about 900 million seconds. The eye has about 6-7 million cones (highly sensitive to light intensity) and is able to distinguihs about 1 million levels of intensity. Multiply 6 billion by 2 (two eyes) and by 3 (we see colors too). 36 billion times 900 million is 3*10^22 bits. Since we sleep etc., let's say our sensory experience is 10^22 bits. Oh, lets multiply by 20 since we do notice visual changes happening up to that rate. So the final number is 2*10^23 bits.
The capacity of the brain is 2*10^14 bits, far less than our sensory experience.
Don't worry, your brain did not fill up when you were 3*10^(-8) years old. Instead, the sensory experience we just computed is *raw* data. If I look at a peaceful scenery for 1 hour, my brain does not probably store the full data of what I see 20 times per second. Some data is ignored and the rest is represented, i.e. coded in the brain. The representation is probably used in coding future sensory data; what happens in the brain when I start to look at the scenery is different than what happens after 1 hour of looking.

6.1
max 2p In the MML principle the coding cost is measured by assuming a random discretisation of the parameters w_i with accuracies e_i. The coding cost of the accuracies is neglected. Show that the MML principle can be seen as a special case of ensemble learning and that the bits-back argument also justifies the neglection of the coding cost of the accuracies. (Hint: consider what kind of ensemble distribution for the weights is implicitly used in the MML principle.)
The ensemble distribution implicitly assumed in the MML method is a hyper-box whose sides have lengths e_i. Let us denote this distribution by Q(w). The coding length in MML is L(parameters) + L(data | parameters) = - log P(w₁)e₁ - log P(w₂)e₂ - ... - E_Q{log P(D | w)}.
The bits-back argument assumes that the sender picks the parameters w from distribution Q(w) and encodes them with an accuracy approaching infinity (e_i -> 0). This makes the terms -log P(w_i)e_i infinite, but an infinite number of bits are got back also, since the sender can embed a secondary message in the choise of the parameters. The expected total number of bits used in coding is L(parameters) + L(data | parameters) - bits back = - E_Q{log P(w)} - E_Q{log P(D | w) } + E_Q{log Q(w)}.
Choosing the Q(w) to be the same as in MML yields identical coding length since - E_Q{log P(w)} + E_Q{log Q(w)} = - log P(w₁)e₁ - log P(w₂)e₂ - ...
In the MML scheme the coding cost of the discretation accuracies e_i is neglected, but in the bits-back scheme the sender and the receiver can agree in advance on arbitrarily small discretation intervals e_i.

7.1
max 1p Exercise 20.1 in the book, page 387.
Only the sign of the constant n is significant, since the activation is a linear combination of the weights and the constant appears as a multiplier of the activation. The state is updated using a step function of the activation, where only the *sign* of the activation is significant.

7.2
max 2p Exercise 21.3 in the book, page 403.
Straightforward but tedious calculation, omitted here.

8.1
max 2p Exercise 19.1 in the book, page 379.
Straightforward but tedious calculation, omitted here.

8.2
max 2p
Exercise 13.3 in the book, page 276.
Straightforward calculation, omitted here.

Tästä sivusta vastaa Harri.Lappalainen@hut.fi.
Sivua on viimeksi päivitetty 26.2.1999.
URL: http://www.cis.hut.fi/cis/edu/Tik-61.181/solutions.html