1. Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Week 1: Probabilities Refresher
Today’s Class
Christof Monz
Data Mining - Week 1: Probabilities Refresher
1
Quick refresher of probabilities
Essential Information Theory
Calculus in one slide
2. Probabilities: Refresher
Christof Monz
Data Mining - Week 1: Probabilities Refresher
2
Experiment (trial): Repeatable procedure with
well-defined possible outcomes
Sample Space (S): the set of all possible
outcomes (finite or infinite)
• Example: coin toss experiment possible outcomes:
S = {heads, tails}
• Example: die toss experiment possible outcomes:
S = {1,2,3,4,5,6}
Probabilities: Sample Space
Christof Monz
Data Mining - Week 1: Probabilities Refresher
3
Definition of sample space depends on what we
are asking
Sample Space (S): the set of all possible
outcomes
Example: die toss experiment for whether the
number is even or odd
• possible outcomes: {even, odd}
• not {1,2,3,4,5,6}
3. Probabilities: Definitions
Christof Monz
Data Mining - Week 1: Probabilities Refresher
4
An event is any subset of outcomes from the
sample space
Example: let A represent the event such that
the outcome of the die toss experiment is
divisible by 3
• A = {3,6}
• A is a subset of the sample space S= {1,2,3,4,5,6}
Example: suppose sample space S =
{heart,spade,club,diamond} (deck of cards)
• let A represent the event of drawing a heart: A =
{heart}
• let B represent the event of drawing a red card: B =
{heart,diamond}
Probability Function
Christof Monz
Data Mining - Week 1: Probabilities Refresher
5
The probability law assigns to an event a
nonnegative number called P(A) (also called the
probability of A)
P(A) encodes our knowledge or belief about the
collective likelihood of all the elements of A
Probability law must satisfy certain properties
4. Probability Axioms
Christof Monz
Data Mining - Week 1: Probabilities Refresher
6
Non-negativity: P(A) ≥ 0, for every event A
Additivity: If A and B are two disjoint events,
then the probability of their union satisfies:
P(A ∪B) = P(A)+P(B)
Normalization: The probability of the entire
sample space S is equal to 1, i.e. P(S) = 1
Probabilities: Example
Christof Monz
Data Mining - Week 1: Probabilities Refresher
7
An experiment involving a single coin toss
There are two possible outcomes, H and T, i.e.
the sample space S = {H,T}
If coin is fair, one should assign equal
probabilities to 2 outcomes
P({H}) = 0.5
P({T}) = 0.5
P({H,T}) = P({H})+P({T}) = 1.0
5. Probabilities: Example II
Christof Monz
Data Mining - Week 1: Probabilities Refresher
8
Experiment involving 3 coin tosses
Outcome is a 3-long string of H or T: S =
{HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
Assume each outcome is equiprobable
(“Uniform distribution”)
What is probability of the event that exactly 2
heads occur?
A = {HHT,HTH,THH}
P(A) = P({HHT})+P({HTH})+P({THH})
P(A)= 1/8 + 1/8 + 1/8 = 3/8
Joint and Conditional Probabilities
Christof Monz
Data Mining - Week 1: Probabilities Refresher
9
The joint probability P(A,B) is the probability
of two events (A and B) occurring together
The conditional probability P(A|B): Assume
event B is the case, what is probability of event
A being the case as well?
Note: P(A|B) = P(A,B) (not necessarily)
Definition: P(A|B) =
P(A,B)
P(B)
P(A,B) = P(B,A) but P(A|B) = P(B|A)
6. Bayes’ Rule
Christof Monz
Data Mining - Week 1: Probabilities Refresher
10
Chain Rule:
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
Bayes’ rule lets us swap the order of dependence
between events
P(A|B) =
P(B|A)P(A)
P(B)
Determining Probabilities
Christof Monz
Data Mining - Week 1: Probabilities Refresher
11
So far we have assumed that the values that P
assigns to events is given
Determining P is an important part of machine
learning
In an empirical setting, P is of estimated by
using relative frequencies:
• P(A) = freq(A)
N
where freq(A) is the frequency of A in a sample set, and
N is the size of the sample set
7. Entropy
Christof Monz
Data Mining - Week 1: Probabilities Refresher
12
Entropy measures the amount of uncertainty in
a variable (the variable ranges over points in the
sample space)
The amount of uncertainty is commonly
measured in bits
H(p) = H(X) = − ∑
x∈X
p(x)log2p(x)
Entropy: Example
Christof Monz
Data Mining - Week 1: Probabilities Refresher
13
let x represent the result of rolling a (fair)
8-sided die
H(X) = − ∑
x∈X
p(x)log2p(x)
H(X) = − ∑
x∈X
1/8log21/8
H(X) = − ∑
x∈X
1/8 ·−3 = 3
Each equiprobable outcome can be represented
by 3 bits:
1 2 3 4 5 6 7 8
001 010 011 100 101 110 111 000
8. Entropy: Better Encoding
Christof Monz
Data Mining - Week 1: Probabilities Refresher
14
If the probability distribution is not uniform, one
can achieve lower entropy
Example: Consider a unfair 4-sided die
value probability
1 0.5
2 0.125
3 0.125
4 0.25
H(X) = 0.5log20.5 +0.25log20.25 +
0.125log20.125 = 1.75
Entropy: Better Encoding
Christof Monz
Data Mining - Week 1: Probabilities Refresher
15
value probability code1 code2
1 0.5 00 0
2 0.125 01 110
3 0.125 10 111
4 0.25 11 10
Average number of bits:
• code1: 0.5 ·2bits +0.25 ·2bits +0.25 ·2bits = 2bits
• code2: 0.5 ·1bit +0.25 ·3bits +0.25 ·2bits = 1.75bits
9. Entropy: Saving Bits
Christof Monz
Data Mining - Week 1: Probabilities Refresher
16
Coding tree: How many yes-no questions must
be asked to determine each message?
0
10
110 111
In general, the optimal number of bits can be
computed as:
− log2p(x) bits for each message x ∈ X
or: log2
1
p(x)
bits for each message x ∈ X
Tiny Calculus Refresher
Christof Monz
Data Mining - Week 1: Probabilities Refresher
17
Derivate: The (first) derivative of function
allows us to compute the rate of change for any
point
Rate of change: slope of the tangent
For multi-variable functions we compute the
partial derivatives for each variable separately
Derivatives are computed by applying
differentiation rules:
• ∂
∂x
(φ+ψ) = ∂
∂x
φ+ ∂
∂x
ψ
• ∂
∂x
cxn
= cnxn−1
• ∂
∂x
f(g(x)) = f (g(x))g (x) (chain rule)
10. Recap
Christof Monz
Data Mining - Week 1: Probabilities Refresher
18
Probability distributions (joint, conditional)
Bayes’ rule
Entropy