Level 3 (upper level) Suomeksi

Bayesian probability theory

Reverend Thomas Bayes (1702-1761)

Meaning of probability

There are two schools about the interpretation of probability. In classical statistics, probability is interpreted as a limiting frequency when an experiment is repeated infinitely many times. For instance in throwing a dice, the probability of having three is one out of six (exactly so only if the dice is ideal).

In everyday language the probability is, however, understood is a wider sense. One can, for example, speak about the probability of rain tomorrow, even though the event is unique and there is no way its frequency could be measured by repeated experiments. Moreover, different people can give the same event different probability. This is natural since different people have different background knowledge and beliefs.

The interpretation of Bayesian probability theory is very close to everyday language. Probability expresses how strongly someone believes in something. Belief is always subjective and depends on background knowledge. Notation P(A | B) means: how true A seems if B is assumed. Often all the background knowledge is denoted and P(A) can thus mean different things depending on which background assumptions are used. It is good to remember, however, that according to Bayesian interpretation there is no absolute probability since there doesn't exist an absolutely correct set of background assumptions.

Sometimes the interpretation of probability has no effect on how the actual computations are conducted or what is the result. For the probabilities in dice throwing, for example, the interpretation has no significance. However, from the point of view of learning and intelligent systems, the difference in interpretation is significant.

Boolean algebra (George Boole 1854)

Propositions, for which the probabilities are defined, obey the rules of Boolean algebra. It is defined for elements which have two binary operations, sum and product, and an unary operation, complement, which will be denoted here by ¬. The set of axioms defining the Boolean algebra is

There exist elements 0 and 1, which are not equal. [A1]
AB = BA A+B = B + A [A2]
A(B+C) = (AB)+(AC) A+(BC) = (A+B)(A+C) [A3]
1A = A 0+A = A [A4]
A¬A = 0 A+¬A = 1 [A5]

The axioms on the same row are dual. If the product and sum, and 0 and 1 are exchanged, one can transform between the dual axioms. Let's denote the axioms on the left hand column by a and right hand by b, i.e., A2b means the axiom AB = BA. From the axioms one can derive the following lemmas

¬¬A = A [L1]
AA = A A+A = A [L2]
¬1 = 0 ¬0 = 1 [L3]
AB = 0 & A+B = 1 => B = ¬A [L4]
0A = 0 1+A = 1 [L5]
A(A+B) = A A+AB = A [L6]
A(BC) = (AB)C A+(B+C) = (A+B)+C [L7]
¬A(AB) = 0 ¬A+(A+B) = 1 [L8]
¬(AB) = ¬A+¬B ¬(A+B) = ¬A¬B [L9]
AB = 1 => A = 1 A+B = 0 => A = 0 [L10]

Boolean logic will be obtained when only the elements 0 and 1 are taken in the algebra. Zero is interpreted as false and one as truth. Product means the and, sum the or and complement the negation operation.

The basic rules of Bayesian probability theory

The Bayesian probability theory can be based on a few simple rules. It is evident that a proposition and its negation are related. According to the sum rule their probabilities sum up to one.

Sum Rule: P(A | B) + P(¬A | B) = 1

If one wishes to verify the truth of AB, one can first verify A and then verify B assuming A. Hence P(AB | C) is evidently a function of P(A | C) and P(B | AC). The product rule states that this function is a product.

Product Rule: P(AB | C) = P(A | C) P(B | AC)

Probability is a real number between zero and one. The probability is not defined if the background assumptions, premisses, conflict. P(A | B¬B), for example, is undefined.

Other rules

Using the rules of arithmetics and Boolean algebra, all other rules of Bayesian probability theory can be derived from the sum and product rule. Let's take the derivation of the generalised sum rule for example. In what follows, the rule that will be applied is denoted at each step, unless only the rules of basic arithmetics are applied.
P(A+B | C) = [L1]
P(¬¬(A+B)) | C) = [L7b]
P(¬(¬A¬B) | C) = [Sum Rule]
1 - P(¬A¬B | C) = [Product Rule]
1 - P(¬A | C) P(¬B | ¬AC) = [Sum Rule]
1 - P(¬A | C) [1 - P(B | ¬AC)] =
1 - P(¬A | C) + P(¬A | C) P(B | ¬AC) = [Sum Rule]
P(A | C) + P(¬A | C) P(B | ¬AC) = [Product Rule]
P(A | C) + P(¬AB | C) = [A2a]
P(A | C) + P(B¬A | C) = [Product Rule]
P(A | C) + P(B | C) P(¬A | BC) = [Sum Rule]
P(A | C) + P(B | C) [1 - P(A | BC)] =
P(A | C) + P(B | C) - P(B | C) P(A | BC) = [Product Rule]
P(A | C) + P(B | C) - P(BA | C) = [A2a]
P(A | C) + P(B | C) - P(AB | C)

Usually, of course, not all the intermediate results are presented. From the sum and product rule, also the equations P(1 | A) = 1 and P(A | B) > 0 => P(A | AB) = 1 can be derived. Let's denote x = P(1 | A). Then

1 - x = 1 - P(1 | A) = P(0 | A) = P(10 | A) = P(1 | A) P(0 | 1A) = x(1 - x) => x² - 2x + 1 = 0,

whose only solution is x = 1. On the other hand,

P(A | B) = P(AA | B) = P(A | B) P(A | AB),

and it follows that P(A | AB) = 1 if P(A | B) > 0.

Marginalisation principle

Let's assume that B1, B2, ..., Bn are n propositions, one of which is true. Thus B1 + B2 + ... + Bn = 1 and BiBj = 0, unless i = j. The generalised sum rule yields

P(AB1+AB2 | C) = P(AB1 | C) + P(AB2 | C) - P(AB1AB2 | C) = P(AB1 | C) + P(AB2 | C).

This follows from AB1AB2 = A(B1B2) = A0 = 0. Adding AB3 gives

P(AB1+AB2+AB3 | C) = P(AB1 | C) + P(AB2 | C) + P(AB3 | C) - P((AB1 + AB2)AB3 | C) = P(AB1 | C) + P(AB2 | C) + P(AB3 | C).

Continuing to ABn results in

P(AB1 + AB2 + ... + ABn | C) = P(AB1 | C) + P(AB2 | C) + ... + P(ABn | C).

On the other hand, since AB1 + AB2 + ... + ABn = A(B1 + B2 + ... + Bn) = A1 = A, we have

P(A | C) = P(AB1 | C) + P(AB2 | C) + ... + P(ABn | C).

By applying the product rule we get the marginalisation principle

P(A | C) = P(A | B1C) P(B1 | C) + ... + P(A | BnC) P(Bn | C).

The significance of the principle become clear, then the propositions Bi are interpreted as possible explanations for A. The probability of A is thus the sum of probabilities which different explanations give for A weighed by the probabilities of the explanations.

Bayes' rule

The Bayes' rule can be derived from the product rule. It tells how the probabilities of explanantions change, when A is observed.

P(Bi | AC) = P(Bi | C) P(A | BiC) / P(A | C)

P(Bi | C) is the probability before the knowledge about A and it is called the prior probability of Bi. Correspondingly, P(Bi | AC) is called the posterior probability of Bi. One can see from the Bayes' rule that the posterior probabilities of explanations Bi which explain A well are higher than the prior probabilities and vice versa.

An example hopefully illuminates the use of Bayes' rule. A = I have fever, B1 = I have a flu and B2 = no flu = ¬B1. Let's assume that I know the probabilities P(A | B1C), P(A | B2C) and P(B1 | C), i.e., the probabilities of having fever when having flu, of having fever without having flu and of having flu in the first place. Let's assing them the numerical values P(A | B1C) = 0.95, P(A | B2C) = 0.05 and P(B1 | C) = 0.1. According to the marginalisation principle, the probability of having fever is

P(A | C) = P(A | B1C) P(B1 | C) + P(A | B2 C) P(B2 | C) = 0,95 * 0,1 + 0,05 * 0,9 = 0,095 + 0,045 = 0,14.

The probability of having flu is originally fairly small, only one in 10. If it now turns out that I have fever, the probability of flu increases

P(B1 | AC) = P(B1 | C) P(A | B1C) / P(A | C) = 0,1 * 0,95 / 0,14 = 0,68..

Together the marginalisation principle and the Bayes' rule tell how the belief in a hypothesis changes when observations are made and how the beliefs in hypotheses are taken into account when making predictions based on them.

Probability density

With real valued quantities, the probability of any particular value is usually 0. If, for instance, according to a measurement the length of a pencil is about 16 cm, the probability of the length being exactly 16 cm is zero. The probability that the length is between 15 cm and 17 cm can, in contrast, easilly be very close to one.

The phenomenon is tha same as in measuring a mass. If one takes a single point of an object, it doesn't have any mass. If one takes a volume instead, the mass differs from zero. Just like the density of an object equals to the mass divided by volume, the probability density is the probability of a range divided by its length.

The Bayes' rule will remain the same also when using probability densities.

Often probability mass is denoted by capital P and density by lower case p, but usually it becomes clear from the contex whether probability mass or density is ment.


Level 3 (upper level) Suomeksi

Last updated 15.10.1998
Harri Lappalainen

<Harri.Lappalainen@hut.fi>