2. 2 Probability A random variable isthe basic element of probability Refers to an event and there is some degree of uncertainty as to the outcome of the event For example, the random variable A could be the event of getting a heads on a coin flip
3. Classical vs. Bayesian Classical: Experiments are infinitely repeatable under the same conditions (hence: ’frequentist’) The parameter of interest (θ) is fixed and unknown Bayesian: Each experiment is unique (i.e., not repeatable) The parameter of interest has an unknown distribution
4. Classical Probability Property of environment ‘Physical’ probability Imagine all data sets of size N that could be generated by sampling from the distribution determined by parameters. Each data set occurs with some probability and produces an estimate “The probability of getting heads on this particular coin is 50%”
5. Bayesian Probability ‘Personal’ probability Degree of belief Property of person who assigns it Observations are fixed, imagine all possible values of parameters from which they could have come “I think the coin will land on heads 50% of the time”
7. Conditional Probability P(A = true | B = true) = Out of all the outcomes in which B is true, how many also have A equal to true Read this as: “Probability of A conditioned on B” or “Probability of A given B” H = “Have a headache” F = “Coming down with Flu” P(H = true) = 1/10 P(F = true) = 1/40 P(H = true | F = true) = 1/2 “Headaches are rare and flu is rarer, but if you’re coming down with flu there’s a 50-50 chance you’ll have a headache.” P(F = true) P(H = true)
8. The Joint Probability Distribution We will write P(A = true, B = true) to mean “the probability of A = trueandB = true” Notice that: P(H=true|F=true) P(F = true) P(H = true) In general, P(X|Y)=P(X,Y)/P(Y)
9. The Joint Probability Distribution Joint probabilities can be between any number of variables eg. P(A = true, B = true, C = true) For each combination of variables, we need to say how probable that combination is The probabilities of these combinations need to sum to 1 Sums to 1
10.
11. P(A=true, B = true | C=true) = P(A = true, B = true, C = true) / P(C = true)
12. The Problem with the Joint Distribution Lots of entries in the table to fill up! For k Boolean random variables, you need a table of size 2k How do we use fewer numbers? Need the concept of independence
13. Independence Variables A and B are independent if any of the following hold: P(A,B) = P(A)P(B) P(A | B) = P(A) P(B | A) = P(B) This says that knowing the outcome of A does not tell me anything new about the outcome of B.
14. Independence How is independence useful? Suppose you have n coin flips and you want to calculate the joint distribution P(C1, …, Cn) If the coin flips are not independent, you need 2n values in the table If the coin flips are independent, then Each P(Ci) table has 2 entries and there are n of them for a total of 2n values
15. Conditional Independence Variables A and B are conditionally independent given C if any of the following hold: P(A, B | C) = P(A | C)P(B | C) P(A | B, C) = P(A | C) P(B | A, C) = P(B | C) Knowing C tells me everything about B. I don’t gain anything by knowing A (either because A doesn’t influence B or because knowing C provides all the information knowing A would give)
16. Example: A Clinical Test Let’s move to a simple example, a clinical test. Consider a rare disease such that at any given point of time, on average 1 in 1000 people in the population has the disease. There exists also a clinical test for the disease (a blood test, for example), but it is not perfect. The sensitivity of the test, i.e. the probability of giving a correct positive, is .99. The specificity of the test, i.e. the probability of giving a correct negative is .98. To better visualise it, let’s draw a tree
18. Example: A Clinical Test Now let us consider a person, who gets worried, takes a test, and get’s a positive result. Remembering, that the test is not perfect, what can we say about the actual chances of his being ill?
19. Example: A Clinical Test .99 + .0001 Disease .01 - .02 + .9999 No Disease .98 - Applying the Bayes’ formula we get the result: the probability of our patient being ill, given what we know about the prevalence of the disease in the population and about the test performance, is .0049. Naturally, higher than in the background population, but still very small. Suppose it is customary to repeat the test, and the result is positive again
20. .99 + .0001 Disease .01 - .02 + .9999 No Disease .98 - Example: A Clinical Test .99 + .0049 Disease .01 - .02 + .9951 No Disease .98 - We can still use the same formula, of course, and the same tree for the visualisation, but now the probabilities of having the disease and being disease free before the test are not .0001 and .9999 respectively, but. Instead, are .0049 and .9951 based on the result of the test so far. After applying the formula this time, we get the result of .20. Still not too high, implying that the test is probably not very efficient, because evidently you have to repeat it many times before you become reasonably sure.
21. Bayesian Inference: Combine prior with model to produce posterior probability distribution Step-wise modeling:
22. Example Thumbtack problem: will it land on the point (heads) or the flat bit (tails)? Flip it N times What will it do on the N+1th time? How to compute p(xN+1|D, ξ) from p(θ|ξ)?
23. Inference as Learning Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations, you can infer the bias of the coin This is learning. This is inference.
29. Posterior Distribution When the data is ’strong enough’ the posterior will not depend on prior Unless some extraneous information is available, a non-informative or vague prior can be used, in which case the inference will mostly be based on data. If, however, some prior knowledge is available regarding the parameters of interest, then it should be included in the reasoning. Furthermore, it is a common practive to conduct a sensitivity analysis – in other words, to examine the effect the prior might have on the posterior distribution. In any case, the prior assumptions should always be made explicit, and if the result is found to be dependent on them, such sensitivity should be investigated in detail.
32. A Bayesian Network A Bayesian network is made up of: 1. A Directed Acyclic Graph A B C D 2. A set of tables for each node in the graph
33. A Directed Acyclic Graph A node X is a parent of another node Y if there is an arrow from node X to node Yeg. A is a parent of B Each node in the graph is a random variable A B C D Informally, an arrow from node X to node Y means X has a direct influence on Y
34. A Set of Tables for Each Node Each node Xi has a conditional probability distribution P(Xi | Parents(Xi)) that quantifies the effect of the parents on the node The parameters are the probabilities in these conditional probability tables (CPTs) A B C D
35. A Set of Tables for Each Node Conditional Probability Distribution for C given B For a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1 If you have a Boolean variable with k Boolean parents, this table has 2k+1 probabilities (but only 2k need to be stored)
36. Bayesian Networks Two important properties: Encodes the conditional independence relationships between the variables in the graph structure Is a compact representation of the joint probability distribution over the variables
37. Conditional Independence The Markov condition: given its parents (P1, P2), a node (X) is conditionally independent of its non-descendants (ND1, ND2) P1 P2 X ND2 ND1 C1 C2
38. The Joint Probability Distribution Due to the Markov condition, we can compute the joint probability distribution over all the variables X1, …, Xn in the Bayesian net using the formula: Where Parents(Xi) means the values of the Parents of the node Xi with respect to the graph
39. Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95) A B C D
40. Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95) This is from the graph structure A B These numbers are from the conditional probability tables C D
41. Inference Using a Bayesian network to compute probabilities is called inference In general, inference involves queries of the form: P( X | E ) E = The evidence variable(s) X = The query variable(s)
42. Inference HasAnthrax HasCough HasFever HasDifficultyBreathing HasWideMediastinum An example of a query would be: P( HasAnthrax = true | HasFever = true, HasCough= true) Note: Even though HasDifficultyBreathing and HasWideMediastinum are in the Bayesian network, they are not given values in the query (ie. they do not appear either as query variables or evidence variables) They are treated as unobserved variables