SlideShare une entreprise Scribd logo
1  sur  134
STATISTICAL METHODS
Probability review/Fundamentals
Enric Cecilla
Brainlab
STATISTICAL METHODS
 Fundamentals
 Basic probability review.
 Random variable.
 Function of a Random variable.
 Sampling distribution.
 Central Limit Theorem.
 Sample mean & Sample Variance
STATISTICAL METHODS
 Inference.
 Introduction
 Confidence interval
 with known variance
 with unknown variance
 Bootstrap
 Introduction
 Examples
STATISTICAL METHODS
 Multivariate methods (PCA).
 Vector basis.
 Orthogonal projection of a vector.
 Eigenvector & eigenvalue.
 Numerical approximation.
 Closed-form solution via covariance matrix.
 SVD decomposition.
 Applications
 Dimensionality reduction.
STATISTICAL METHODS
 Multivariate methods (ICA).
 Independence and uncorrelation
 Whitening
 ICA Example
I - FUNDAMENTALS
WHAT ‘S A RANDOM EXPERIMENT?
 Task that might lead to different outcomes.
 An elemental event is each of the possible
outcomes of an experiment. Ex:
 Tossing a coin has two events: {head, tail}
 Tossing two coins has four events:{hh,ht,th,tt}
 The set of all possible outcomes is called the
sample space. Ex:
 The sample space for rolling a dice is:
S = {1,2,3,4,5,6}
 An event is any subset of the sample space.
 We define the event “odd number” in the experiment
rolling a dice:
Sodd = {1,3,5}
WHAT’S A PROBABILITY?
 Every sample space has its related probability
space.
 When we carry out an experiment several times,
the relative frequency of an event is the quotient
between the number of times the event is verified
and the total number of times we repeated it. Ex:
 We repeat the experiment “tossing a coin” 10 times
and this is the outcome: { t,t,t,h,t,h,h,t,h,t}
 Freq(t) = n(t)/N = 6/10 = 0.6
 Freq(h) = n(h)/N = 4/10 = 0.4
 The probability is the limit of the relative
frequency when Ninf
PROBABILITY LIMIT IN A TOSSING
COIN EXP. (BERNOULLI EXP.)
 Plot of the relative frequency of the outcome
“head” in a coin toss experiment as a function of
the number of repetitions.
 Freq(head) = n(head)/N
PROBABILITY AXIOMS &
PROPERTIES
 Axioms:
 P(A) > 0
 P(S) = 1
 Given A1,..,An mutually exclusive events:
 P( A1 U A2 U … U An) = P(A1) + … + P(An)
 Properties:
 P(ø) = 0
 P(Ac
) = 1-P(A)
 If A subset B  P(A) ≤ P(B)
 P(A U B) = P(A) + P(B) – P(A∩B)
RANDOM VARIABLE
 A rv is a variable whose value, a number, is a
function of the outcome of a random experiment.
Ex:
 For a coin toss, the possible events are heads or tails.
The number of heads appearing in one (fair) coin toss
can be described using the following random variable:
1 if head
X =
0 if tail
Head
Tail
1
0
RANDOM VARIABLE METAPHOR
 It’s like a blackbox that gives you an unkown number
everytime you ask for it.
 These numbers somehow are related!. The frequency
of appearance when the sequence is long enough is
the probability density/mass function. The longer the
sequence is ,the more accurate information we have
about the random variable.
R.V. 1,2,5,1,7,5,1,2,1,5,1,4
RANDOM VARIABLE
 We can classificate random variables in two big
sets:
 Discrete random variable: Sample space is a
countable set: integers.
 Continous random variable: All possible values in the
sample space is in a real numbers range.
PROBABILITY MASS/DENSITY
FUNCTION
 All outcomes of the random variable (sample
space) have an associated probability via the
probability mass/density function.
P( X = x )
FUNCTION OF A RANDOM
VARIABLE
 A function of a random variable is another
random variable : Y = g(X)
 Because it is a random variable it has a
probability mass/density function.
 If X is the random variable associated with
rolling a dice:
 We could define the random variable Y = f(X) = 2X.
The possible values of this new random variable are
{2,4,6,8,10,12} with probability 1/6 each. Which is
PY(y)?
 Another function we could define is Y = g(X) = { 1 if x
is even, 0 if x is odd}, which are Sy and PY(y) ?
EXPECTATION AND VARIANCE OF A
RANDOM VARIABLE
 Given X a random variable we define the
expectation as the function mass centre and the
variance as the spread around this centre.
MEAN & VARIANCE OF A RANDOM
VARIABLE
E[X] & VAR[X]
E[X] = μ = Σ x P(X = x)
Var[X] = E[(X- μ)2
] = Σ (x - μ)2
P(X = x)
 Ex: X~bin(7,0.4) function E = expectation( sample_space, pmf )
E = sum( sample_space .* pmf );
end
function V = variance( sample_space, pmf )
E = expectation(sample_space,pmf);
V = sum((sample_space-E).^2 .* pmf);
end
>> n = 7;
>> p = 0.4;
>> S = 0:n;
>> pdf = binopdf(S, n, p);
>> stem(S,pdf);
>> expectation(S,pdf)
ans = 2.8000
>> variance(S,pdf)
ans =1.6800
SKEWNESS & KURTOSIS
 The skewness is a measure of asymmetry of the
probability distribution.
 Is the third standarized moment around the
mean.
SKEWNESS & KURTOSIS
 It is a measure of normality/peakedness of the
random variable.It is the fourth standarized
moment around the mean:μ4/4
 Sometimes it’s defined as μ4/4
-3, it’s a correction
to make the kurtosis of the normal distribution
equal to zero.
 Distributions with 0 excess kurtosis are called
mesokurtic.
 Distributions with positive excess kurtosis are
called leptokurtic.
 Distributions with negative excess kurtosis are
called platykurtic.
KURTOSIS
SAMPLING DISTRIBUTION
 A sampling distribution is the distribution of a
given statistic (r.v. function) based on a random
sample of size n.
 The sampling distribution depends on the
underlying distribution of the population, the
statistics being considered and the sample size
used.
f
g
SAMPLING DISTRIBUTION
 For example: consider a normal population with
mean μ and variance 2
. Assume we take
samples of size n from this population and
calculate the arithmetic mean for each sample
(sample mean statistic).
 Each sample will have its own average value and
the distribution of these averages will be called
“sampling distribution of the sample mean”.
 This distribution will be normal N(μ, 2
/n).
 For other statistics and other populations the
formulas are frequently more complicated and
sometimes they don’t even exist in closed-form.
CENTRAL LIMIT THEOREM
 Let Sn be the sum of n random variables i.i.d.:
Sn = X1 + …+ Xn (Xi ~ ANY distribution)
E[Sn] = n μ
Var[Sn] = n 2
 We define Zn as:
Zn = (Sn – nμ)/ ( √n)
 As n grows to infinity Zn converges to N(0,1).
 As n grows the kurtosis will converge to 0. So,
the more additions (n) the more normal will be
the random variable.
CENTRAL LIMIT THEOREM
CENTRAL LIMIT THEOREM
NUMERICAL DEMONSTRATION
 As N grows the distribution converges to N(0,1).
>> out = sample_mean( 2);
>> kurt(out)
ans =
-0.5745
>> out = sample_mean( 5 );
>> kurt(out)
ans =
-0.1728
>> out = sample_mean( 10 );
>> kurt(out)
ans =
-0.1104
>> out = sample_mean( 20 );
>> kurt(out)
ans =
-0.0802
N = 2 N = 5 N = 20
No matter the source distribution.
NUMERICAL DEMONSTRATION
>> samples = sample_mean(2,'exp');
>> kurt(samples)
ans =
3.1135
>> skewness( samples )
ans =
1.3566
>> samples = sample_mean(5,'exp');
>> kurt(samples)
ans =
1.1265
>> skewness( samples )
ans =
0.8995
>> samples = sample_mean(10,'exp');
>> kurt(samples)
ans =
0.6619
>> skewness( samples )
ans =
0.7450
>> samples = sample_mean(50,'exp');
>> kurt(samples)
ans =
0.1188
>> skewness( samples )
ans =
0.2686
N = 2 N = 5 N = 10 N = 50
SAMPLE MEAN
 Given a sequence of random variables (iid) X1,
…,Xn we define the sample mean as:
Xm = (X1 + …+Xn)/n
 As Xm is a function of random variables then it is
a random variable with a probability
mass/density function associated.
 For a long n (n>30) the probability density is
normal because of the CLT.
 E[Xm] = E[Xi] = μ
 Var(Xm) = Var(Xi)/n = 2
/n
 SD(Xm) = /√n
 Xm ~ N(μ , 2
/n)
 Standard deviation/Uncertainty reduction
around population mean with 1/√n relation.
SAMPLE MEAN
SAMPLE VARIANCE
 Xm ~ N(μ , 2
/n)  This random variable spreads
out around the population mean μ, so it is an
unbiased estimator.
 S2
= 1/(n-1) Σ(x-μ)2
(Cochran’s theorem)
S2
~ 2
/(n-1)χ2
n-1
 Probability
density distribution
of χ2
k = Σi=1
k
Xi
2
,
Xi ~N(0,1)
Given a uniform distribution [-1, 1]  its standard
deviation is  = 1/√3 and its kurtosis is -6/5.
E[S2
]= 2
=1/3
Var(S2
) = 4
(2/(n-1) + Kurt/n)= 4
( 2/(n-1) + μ4/(n4
) )
SAMPLE VARIANCE
%Uniforme de -1 a 1.
samples = unifrnd(-1,1,1,N);
mu = 0;
sigma = sqrt(1/3);
N = 4 N = 20 N = 200
SAMPLE VARIANCE, NUMERICAL
EXAMPLE
>> [smean svar] = sample_mean_var(4,'uni');
>> var(svar)
ans =
0.0410
>> ((1/3)^2)*(2/3 -6/20)
ans =
0.0407
>> [smean svar] = sample_mean_var(5,'uni');
>> var(svar)
ans =
0.0287
>> ((1/3)^2)*(2/4 -6/25)
ans =
0.0289
>> [smean svar] = sample_mean_var(10,'uni');
>> var(svar)
ans =
0.0113
>> ((1/3)^2)*(2/9 -6/50)
ans =
0.0114
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
1
2
3
4
5
6
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
N = 2
N = 3
N = 4
N = 5
N = 6
II - INFERENCE
 Statistical inference is the process of making
conclusions from datasets arising from systems
affected by random variation.
 Inference makes propositions about populations,
using data drawn from the population of interest
via some form of sampling.
INFERENCE
population
sample
We’ll have to deal with
this partial
information!
SAMPLING CONSEQUENCES
 The goal behind inference is to determine
whether an observed effect, such as difference
between two means or the correlation between
two variables , could reasonably be attributed to
the randomness in selecting the sample.
 If not, we have evidence that the effect observed
in the sample reflects an effect that is present in
the population.
Population 1
Sample
Population 2
Sample
?
??
?
? ?
1,3,6,7,2,
0,-3,-5
SAMPLING EXAMPLE
 Imagine we have the following population and we
take a sample of size 3 and compute the sample
mean:
 Remember sampling distribution (theoretical or
empirical)!!
1,3,6,7,2,0,-3,-5
7,6,3
1.37
5.33
So!, we need something to tell us
whether 5.33 makes sense as a
population mean or not!! We need to
answer this questioninference.
CONFIDENCE INTERVAL
 It’s an interval estimate of a population
parameter. Ex: population mean.
 Instead of estimating the parameter by a single
value, an interval likely to include the interval is
given.
 How likely the interval is to contain the
parameter is given by the confidence level.
 Increasing the confidence level will widen the
confidence interval.
 The end points of the confidence interval are
referred to as confidence limits and these ending
points are random variables as well and the
interval is a random variable too.
CONFIDENCE INTERVAL
 The confidence level usually is 0.95 o 0.99. The
intuition behind this number is that if we
repeated the experiment/measure 100 times, 95
of them the interval would hold the true
parameter value.
 Imagine we know population’s standard
deviation  (from previous studies):
 Z = (Xm – μ)/(/√n) ~ N(0,1)
 We can select two symmetrical points -z/2 and z/2 for
which P(-z/2 ≤ Z ≤ z/2 ) = 1-
 P( Xm - z/2 /√n ≤ μ ≤ Xm + z/2 /√n) = 1-
 So the probability of μ belonging to the CI is 1-
 Random confidence interval : [Xm +/- z/2 /√n ]
First, let’s have a look at the
theoretical-oriented sampling
distribution and hence to
theoretical confidence
intervals !!
FZ(X)-FZ(0)
CONFIDENCE INTERVAL: 1-FZ(X)
TABLE
There are different
tables to express the
same!!
CONFIDENCE INTERVAL
The horizontal line segments represent 100
realisations of a confidence interval for μ.
Acceptance
region
Critical region
CONFIDENCE INTERVAL WITH
KNOWN VARIANCE
 What does P(μ є [Xm +/- z/2 /√n ]) = 1- mean?
 We know that by repetition in 100(1- )% of the cases μ
will be in the calculated interval.
 In practice we only have one repetition of the
experiment and so a single confidence interval. This is
why we rely on our interval to hold the parameter and it
will in 100(1- )% of the cases!
 The CI length is: 2z/2 /√n and our interest will be
to get this interval the shorter as possible.
 The confidence interval length grows as the confidence
level grows and viceversa however we should keep it in
reasonable limits (95%, 99%).
CONFIDENCE INTERVAL WITH
KNOWN VARIANCE
 The estimation will be more accurate as the
population variance is smaller which means that the
population sampling is more homogeneous.
 The length of the interval decreases as we increase
the sampling size.
CONFIDENCE INTERVAL WITH
UNKNOWN VARIANCE
 The confidence interval if we don’t know the
population variance can be estimated with the
sample variance S2
, CI: [Xm +/- tn-1,/2 S/√n ]
T Student
distrib
BOOTSTRAP INTRODUCTION
 The revolution in computing is having a dramatic
influence on statistics
 Statistical study of very large and very complex data
sets becomes feasible (fMRI analysis).
 These methods, bootstrap confidence intervals
and permutation tests apply computing power to
relax some of the conditions needed for
traditional inference (normality).
 The main goal is to compute bias, variability and
confidence intervals for which theory doesn’t
provide closed form solutions (formulas).
 Closed form solutions are replaced by brute force
computing.
Now, let’s have a look at the
data-oriented sampling
distribution !!
BOOTSTRAP INTRODUCTION
 It’s a procedure in which computer simulation
through resampling data replaces mathematical
analysis.
 The big idea behind is: The sample is an estimate
of the population (if large enough), so take the
sample as if it is the population itself.
Population
Sample
estimate
Our real truth about
the population!!
BOOTSTRAP INTRODUCTION
 So let’s treat the data as a proxy for the true
distribution !!
 In Bradley Efron words:
 “Bootstrapping requires very little in the way of
modeling, assumptions, or analysis, and can be
applied in an automatic way to any situation, no
matter how complicated”
 “An important theme is the substitution of raw
computing power for theoretical analysis”
BOOTSTRAP PROCEDURE
 The big idea: Statistical inference is based on the
sampling distributions of sample statistics (Ex:
sample mean). The bootstrap is first of all a way
of finding the sampling distribution from just
one sample. The bootstrap procedure/algorithm:
 Step 1: Resampling: A sampling distribution is based
on many random samples from the population. In
place of many samples from the population, create
many resamples by repeatedly sampling with
replacement from this random sample.
3.12 0.00 1.57 19.67 0.22 2.20
mean = 4.46
1.57 0.22 19.67 0.00 0.22 3.12
mean = 4.13
0.00 2.20 2.20 2.20 19.67 1.57
mean = 4.64
0.22 3.12 1.57 3.12 2.20 0.22
mean = 1.74
BOOTSTRAP PROCEDURE
 Step 2: Bootstrap distribution: The sampling
distribution of a statistic (data function) collects the
values of the statistic from many samples. The
bootstrap distribution of a statistic collects its values
from many resamples.The bootstrap distribution
gives information about the sampling distribution.
Fboot ≈ Fsampling.The sampling distribution is the key object
to answer questions.
 Step 3: Repeat many times steps 1 and 2.
 This basic procedure can be scripted with few
programming lines with any high level language of
your taste (Matlab, R, C++, …) !!
BOOTSTRAP’S PICTURE FOR
SAMPLE MEAN
0 5 10 15 20 25 30 35 40 45 50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30 35 40 45 50
0
0.1
0.2
0.3
0.4
0.5
0.6
Population Sample Resamples
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0.25
Bootstrap sample
mean distribution
0 5 10 15 20 25 30 35 40 45 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Sample mean
sampling distribution
BOOTSTRAP MATLAB SCRIPT(EASY
ONE)
%The bootstrap method calculates the bootstrap distribution
%of the statistic under study.
%
%sample : Sample data array.
%statistic : handler to the statistic function.
function boot_samples = bootstrap( sample, statistic_handle )
N = 10000;
boot_samples = zeros(1,N);
for j = 1:N
rsample = resample( sample );
boot_samples(j) = statistic_handle( rsample );
end
plot_samples( boot_samples );
end
%resampling the original sample with replacement
function rsample = resample( sample )
n = length( sample );
random_idx = unidrnd(n,1,n);
rsample = sample( random_idx );
end
BOOTSTRAP EXAMPLES
 There are some function in the Matlab statistics
toolbox for bootstraping : bootstrp, bootci.
>> which bootstrp
C:Program FilesMATLABR2007atoolboxstatsbootstrp.m
>> which bootci
C:Program FilesMATLABR2007atoolboxstatsbootci.m
 Let’s do some examples!:
 Bootstraping a correlation coefficient.
 Bootstraping a standard error of the mean.
 Bootstrapping two sample means.
 Bootstrapping the confidence interval of the intercept
regression coefficient.
BOOTSTRAP EXAMPLE:
CORRELATION
 There are some available datasets in matlab to work
with, in the statistical toolbox there are many :
 acetylene.mat : Chemical reaction data with correlated predictors.
 arrhythmia.mat : Cardiac arrhythmia data from the UCI machine
learning repository.
 cities.mat : Quality of life ratings for U.S. metropolitan areas.
 lawdata.mat : Grade point average and LSAT scores from 15 law
schools.
 …
 We will work with lawdata.mat which has two
variables GPA (Grade Point Average) & LSAT (test
designed to measure skills that are considered
essential for success in law school).
 Are these two variables related in some way in a law
school??
BOOTSTRAP EXAMPLE:
CORRELATION
>> load lawdata
>> plot_hist(lsat,'b',10); plot_hist(gpa,'r',10);
>> mean(lsat)
ans =
600.2667
>> mean(gpa)
ans =
3.0947
540 560 580 600 620 640 660 680
0
0.005
0.01
0.015
0.02
0.025
0.03
LSAT
2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
0
0.5
1
1.5
2
2.5
3
GPA
BOOTSTRAP EXAMPLE:
CORRELATION
>> load lawdata
>> plot_hist(lsat,'b',10); plot_hist(gpa,'r',10);
>> scatter(gpa,lsat,'filled');
>> [r p] = corrcoef(lsat,gpa)
r =
1.0000 0.7764
0.7764 1.0000
p =
1.0000 0.0007
0.0007 1.0000
>> bsamples = bootstrap({@(x,y) submat(corrcoef(x,y),1,'1',2,'2'), @stat_resample},95,gpa,lsat);
Confidence interval [0.4584 0.96127]
>> plot_samples( bsamples );
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
Standard Normal Quantiles
QuantilesofInputSample
QQ Plot of Sample Data versus Standard Normal
0.0007 < 0.05
0 is not in the ci  So these two
variables are positively correlated!
2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
540
560
580
600
620
640
660
680
Observed statistic
540 560 580 600 620 640 660 680
0
0.005
0.01
0.015
0.02
0.025
0.03
2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
0
0.5
1
1.5
2
2.5
3
 To estimate the P-value for a test of significance
we have to estimate the sampling distribution of
the statistic when the null hypothesis is true
So we have to resample in a manner that is
consistent with H0.
BOOTSTRAP EXAMPLE:
CORRELATION
>> ci = bootci(10000,{@(x,y) submat(corrcoef(x,y),1,'1',2,'2'),lsat,gpa},'type','per')
ci =
0.4594
0.9606
With Matlab built-in functions (bootci) and the
same method (percentile) the result is almost the
same.
BOOTSTRAP EXAMPLE :
CORRELATION
 To resample in a manner that is consistent with
the null hypothesis we will merge groups &
randomly resample with replacement.
Control Treatment
3.12 0.00 1.57 19.67 20.22 18.20
Mc = 1.563 Mt= 19.36
Merge
3.12 0.00 1.57 19.67 20.22 18.20
Mc = 10.46
Resample consistent with H0
Control Treatment
20.22 18.20 0.00 1.57 3.12 19.67
Mc = 12.80 Mt = 8.12
Resample consistent with H0
Control Treatment
3.12 19.67 20.22 18.20 0.00 1.57
Mc = 14.33 Mt = 6.59
Resample consistent with H0
Control Treatment
3.12 20.22 1.57 18.20 0.00 19.67
Mc = 8.30 Mt = 12.62
To resample in a way that is consistent
with the null hypothesis:
imitate many repetitions of the
random assignment of subjects to
“treatment” and “control” groups.
BOOTSTRAP EXAMPLE :
CORRELATION
>> h0bsamples=bootstrap({@(x,y) submat(corrcoef(x,y),1,'1',2,'2'),
@stat_resample_asH0},95,gpa,lsat); >> plot_hist( h0bsamples, 'b' );
>> hold on
>> plot_hist( bsamples, 'r' );
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
3.5
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
3
3.5
Observed statistic
Sampling distrib | H0
Sampling distrib | H1
Now we know how the
distribution of our
statistic looks like in
both cases.
BOOTSTRAP EXAMPLE :
CORRELATION
 The desired P-value is then estimated as:
 P-value = {t* > t*obs}/B , B = #iterations
 If P-value <   Reject H0.
 If P-value >   Accept H0.
 Matlab simple script to get this value:
%P-value of the statistic given a H0 sampling distribution.
function H0 = pvalue( bsamples, observed_stat, alpha )
p_value = sum( bsamples > observed_stat)/length(bsamples);
H0 = p_value > alpha;
end
>> [H0 p_value] = pvalue( h0bsamples, 0.776, 0.05 )
H0 =
0
p_value =
6.0000e-004
>> [r p] = corrcoef(lsat,gpa)
r =
1.0000 0.7764
0.7764 1.0000
So, we reject H0 and can say that
these two variables are statistically
correlated with a confidence level of
0.05!!
BOOTSTRAP EXAMPLE : STANDARD
ERROR OF THE MEAN (SEM)
 We will start from a known situation in which we
know from theory the value of the standard error
of the mean and we will estimate the same value
by bootstrapping from a sample.
 Supose we have a population following the model
~N(25,4.5)
 We take 5000 samples from this population.
5 10 15 20 25 30 35 40 45
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
>> samples = normrnd(25,4.5,5000,1);
>> plot_hist( samples );
>> mean(samples)
ans =
25.0568
>> std(samples)
ans =
4.4253
Sample mean estimate !
BOOTSTRAP EXAMPLE : SEM
 From theory we kown that : SEM = /√n =
4.5/sqrt(5000) = 0.0636
 As we can see the true value (25) belongs to the
confidence interval, as it should!
>> bsamples = bootstrap({@mean, @stat_resample},95,samples);
Confidence interval [24.9338 25.18].
>> plot_norm_samples( bsamples );
24 24.2 24.4 24.6 24.8 25 25.2 25.4 25.6 25.8 26
0
2
4
6
8
-4 -3 -2 -1 0 1 2 3 4
24.8
25
25.2
25.4
25.6
Standard Normal Quantiles
QuantilesofInputSample
QQ Plot of Sample Data versus Standard Normal
BOOTSTRAP EXAMPLE : SEM
 Let’s make an estimation of the standard error of
the mean by calculating the sample standard
deviation of the bootstrap sampling distribution
for the sample mean statistic. Ok,wait and
repeat it again in your head …
 It’s not so different from the theoretical value:
0.0636!!. To sum up:
>> std( bsamples )
ans =
0.0631
>> 4.5/sqrt(5000)
ans =
0.0636
>> std(samples)/sqrt(5000)
ans =
0.0626
>> std( bsamples )
ans =
0.0631
Theoretical value
Traditional statistics
value.
Bootstrap value.
BOOTSTRAP EXAMPLE : TWO
SAMPLE MEANS
 Imagine that we have two different samples: The
hypothesis test that we have to answer is
whether these two samples are taken from the
same distribution or not.
 H0 : F = G
 H1 : F ≠ G
Population
F = G
Sample1
Sample2
Population
F
Sample1
Sample2Population
G
BOOTSTRAP EXAMPLE : TWO
SAMPLE MEANS
 If the distribution is parametrical we can
formulate in terms of the parameters: F = F(x;θ1),
G = F(x,θ2) (Ex: θ is expectation)
 H0 : θ1 = θ2
 H1: θ1 ≠ θ2
0 5 10 15 20 25 30 35 40 45 50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Sample1 Sample2
0 5 10 15 20 25 30 35 40 45 50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Sample1 Sample2
Non-effect
scenario
Effect
scenario
BOOTSTRAP EXAMPLE : TWO
SAMPLE MEANS
>> csamples = normrnd( 25, 3, 50, 1);
>> tsamples = normrnd( 26, 3, 100, 1);
>> plot_hist( csamples );
>> plot_hist( tsamples, 'r' );
>> mean(tsamples) - mean(csamples)
ans =
1.0648
>> h0bsamples = bootstrap({@(x,y)(mean(x)-mean(y)), @stat_resample_asH0},tsamples, csamples);
>> bsamples = bootstrap({@(x,y)(mean(x)-mean(y)), @stat_resample},tsamples, csamples);
>> [h0 p_value] = pvalue( h0bsamples, mean(tsamples) - mean(csamples), 0.05 )
h0 =
0
p_value =
0.0261
>> [fiho,xiho] = ksdensity( h0bsamples, 'npoints',500 );
>> plot( xiho, fiho );
10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
10 15 20 25 30 35 40
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Is this difference significant?
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
sample mean difference | H0
Sample mean difference| H0
1.0648
The smaller the p-value, the
stronger is the evidence for H0 to
be false. In some texts p-value is
called the Achieved Significance
Level (ASL): P( Tboot> tobs | H0)
As 0.026 < 0.05 we reject H0
BOOTSTRAP EXAMPLE :
REGRESSION COEFFICIENT
CONFIDENCE INTERVAL
 We will try to answer the following question:
 How many points in LSAT test should someone
improve to add 1 point to the GPA value??
 How can we answer to this question?
 With the slope coefficient of a regression fit!
 So we can write GPA as a linear function of LSAT
values: x:LSAT, y = GPA
 y = ax +b y(x+c) – y(x) = 1  ax+ac+b-ax-b = 1 
 ac = 1  c = 1/a  So the inverse of the slope of the
regression line will be our number!
BOOTSTRAP EXAMPLE :
REGRESSION COEFFICIENT
CONFIDENCE INTERVAL
>> load lawdata;
>> x = [ones(size(lsat)) lsat];
>> y = gpa;
>> b = regress(y,x)
b =
0.3794
0.0045
>> yfit = x*b;
>> scatter(x(:,2),y);
>> hold on;
>> plot( x(:,2), yfit,'r');
>> plot( x(:,2), yfit,'or');
Design matrix
540 560 580 600 620 640 660 680
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
GPA
LSAT
>> resid = y - yfit;
>> plot_hist( resid, 'b', 5 );
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Residuals, we will assume
white noise. We will
bootstrap this values into
the regression model!
BOOTSTRAP EXAMPLE :
REGRESSION COEFFICIENT
CONFIDENCE INTERVAL
>> bsamples = bootstrap({@(rsamples) submat(regress(yfit+rsamples,x),1,'2'), @stat_resample},95,resid);
Confidence interval [0.0026841 0.0064508].
>> mean(bsamples)
ans =
0.0045
>> 1/0.0045
ans =
222.2222
III – MULTIVARIATE METHODS
PCA
INTRODUCTION
 PCA is a mathematical method to find out
interesting directions in data clouds.
 These interesting directions are called principal
components.
 Later on we will give an interpretation for
interesting. By now are just somehow interesting.
PCA INTRODUCTION
 Given a data table with variables in columns and
observations in rows, the data cloud is the set of
points resulting from reading each row as a
vector.
-10
-5
0
5
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
VECTORS
 Do you remember what a vector is?
 Do you remember what the euclidean norm of a
vector is?
 Do you remember cartesian and polar
coordinates of a vector?
 Do you remember what a scalar product is
between two vectors?.
VECTOR SPACE
 A vector space is the whole set of vectors spanned
by the vector space basis.
 Which means that any vector x in the vector
space can be expressed as a linear combination
of basis elements (basis vectors) where mk are
scalars and uk are the basis elements (vectors).
 x = m1u1 + …+ mnun
 To span means that you generate any vector by
adding and multiplying by a scalar the basis
vectors.
VECTOR PROJECTION
 Imagine we have vector p and v and want to
project p into v:
 pv is the nearest point to p in the v direction.
 <v,p> = |v| |p| cos()
 |pv| = |p| cos()
 <v,p> = |v|| pv | | pv | = <v,p>/|v|
 pv = | pv | v/|v|  pv = <v,p>/|v| v/|v|
<v,p>/<v,v> v
v
p
pv
VECTOR PROJECTION
0
0.5
1
1.5
2
2.5
3
-1
-0.5
0
0.5
1
0
1
2
x
y
z
>> o = [0 0 0];
>> p0 = [2,1,2];
>> p1 = [3 -1 1];
>> vectarrow(o,p0);
>> hold on;
>> vectarrow(o,p1);
>> p0_p1 = (p0.*p1)/(p1.*p1) * p1;
>> vectarrow(o,p0_p1);
p0
p1
p0_p1
VECTOR BASIS
 Suppose we have a vector x expressed in two
different basis B1 & B2
 B1 = { u1 ,u2 ,u3 ,…, un } , B2 = { v1 ,v2 ,v3 ,…, vn }
 xB1 = (m1, m2, …, mn)
 xB2 = (n1, n2, …, nn)
 x = m1u1 + …+ mnun
 x = n1v1 + …+ nnvn
 We can write ui vectors in B1 as linear
combination of B2 vectors:
 u1 = a11 v1 + a12 v2 + ... + a1n vn
 u2 = a21 v1 + a22 v2 + ... + a2n vn
 …
VECTOR BASIS
 So:
 u1B2
= (a11, a12, a13, …, a1n)
 unB2
= (an1, an2, an3, …, ann)
 We can substitute these new vectors in x:
 x = m1(a11 v1 + a12 v2 + ... + a1n vn) + …+ mn(an1 v1 + an2 v2
+ ... + ann vn)
 x = (m1a11 + m2a21 + ... + mnan1) v1 + …+ (m1a1n + m2a2n + ...
+ mnann) vn
 By inspection we can see that:
 n1 = m1a11 + m2a21 + ... + mnan1
 n2 = m1a12 + m2a22 + ... + mnan2
 …
VECTOR BASIS
 In matrix form:
 xB2 = A xB1
 But, how do we calculate these aij numbers?
 Solving a linear system of equations.
 If the basis vectors are orthogonal projecting ui over
vj vector:
 aij = < ui , vj >/< vj , vj > = uixvjx + uiyvjy
VECTOR BASIS: EASY EXAMPLE 1
 We have a basis B={(2,1), (1,4) } and a vector
x=(5,6) in canonical coordinates. We want to get
the vector x expressed as a vector in B basis.
 x=(5,6)  xB = (a,b)
 (5,6) = a(2,1) + b(1,4)
5 = 2a + b -20 = -8a – 4b
6 = a + 4b  6 = a + 4b  -14 =-7a  a = 2
b = 1
So, xB = (2,1)
VECTOR BASIS: EASY EXAMPLE 2
 We have a vector x = (2,3)
 B1 : { (1,0) , (0,1) }
 x = 2(1,0) + 3(0,1)
 B2 : { (-1,0) , (0,2) }
 x = -2 (-1,0) + 3/2 (0,2)  xB2 = (-2,3/2)
 B3 : { (2,1), (-1,2) }
 x = 7/5 (2,1) + 4/5 (-1,2)  xB3 = (7/5,4/5)
B1 B2 B3
EIGENVECTORS & EIGENVALUES
 Suppose A= , then A =
 So Av is a reflected vector around y axis.
-1 0
0 1
x1
x2
-x1
x2
x1
x2
-x1
x2 v =
EIGENVECTORS & EIGENVALUES
 We observe that:
A = -1
A = 1
 Thus, the vectors on the coordinate axes are
mapped to vectors on the coordinate axes. For
those vectors exists a scalar λ so that:
A = λ
x1
0
x1
0
0
x2
0
x2
x1
x2
x1
x2
The direction of the vector
remains invariant to the
transformation A !
EIGENVECTORS & EIGENVALUES
 Example:
Suppose A =
Then is an eigenvector corresponding to
eigenvalue λ = 4.
= = 4
=
2 3
2 1
3
2
2 3
2 1
3
2
12
8
3
2
2 3
2 1
1
3
11
5
Is it an eigenvector?
EIGENVECTORS & EIGENVALUES
 To sum up with a picture:
Which one is the
eigenvector?
Eigenvectors are vectors
which direction is
invariant to the
transformation matrix!
OBJECTIVE FUNCTION.
 A function associated with an optimization
problem which determines how good a solution is.
 PCA as an optimization problem:
 We want to get the direction of the projection that
maximizes the variance of the projected values which
means getting the direction in which much variance
lies on.
 w1 = arg maxw { var(wxt
) } where w is a projection row
vector and x is the data matrix with each row as a
point. Which is the same as minimizing:
θ1 = arg max θ { var( (cos θ,sinθ)xt
) }
 w1 is the principal component of our data cloud.
 This figure shows the projection of the data cloud
to the w direction. The histogram of the projected
values is plotted in blue..
 Remember! :
xt
: Data cloud.
w: Projection direction.
wxt
: Projected cloud.
So, which is best w?
 arg maxw{ var(wxt
) }
DATA CLOUD PROJECTION
w
 Let’s do the same with several projection vectors.
 We get different distributions for projected data.
 In red you can see the
direction of maximum
variance:
w* = maxw{ var(wxt
) }
 For the second w2
component we would
look for maximizing
the same function but
with a restriction: orthogonal to w1 .
DATA CLOUD PROJECTION
w*
PCA: ALGORITHM
 Remove mean from data.
 Calculate covariance matrix.
 SVD decomposition: A way to get eigenvalues &
eigenvectors of the covariance matrix.
 Select eigenvectors sorted by eigenvalue.
 Project to this new space to reduce
dimensionality.
 Project back to original space.
 Add mean to data.
PCA GEOMETRY &
DIMENSIONALITY REDUCTION
PCA & EEG
 We will illustrate PCA to denoise EEG data. The
algorithm showed here should be improved but
it’s a naive approximation to understand PCA.
 Each epoch/sweep can be seen as an N dimension
vector where N is the number of timepoints.
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
PCA & EEG
 All sweeps for my condition build a data cloud in
N-space.
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
D
ata
cloud
PCA & EEG
 As now we’re dealing with vectors and not with
functions any more we can apply a PCA method
on them.
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
>> [trials timeline] = gettrials( 'c:s2-b1.cnt', 2, 5, 0.30 );
>> plot(trials(15,:));
>> [dtrials F] = denoise_pca( trials );
>> plot(dtrials(15,:),'r');
PCA & EEG
 We can plot also some principal components of
this set of trials:
-50 0 50 100 150 200 250 300
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
-50 0 50 100 150 200 250 300
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
-50 0 50 100 150 200 250 300
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
PCA & EEG
function [depochs F] = denoise_pca( epochs )
dim = size(epochs);
%numero de muestras
N = dim(1);
%numero de variables
M=dim(2);
C = cov( epochs );
[V D] = eig( C );
cpower = diag(D);
%Reducimos la dimensionalidad a numPC tpower = sum(cpower) * 0.85;
tpower = sum(cpower) * 0.85;
cpower = cpower(end:-1:1);
cpower = cumsum(cpower);
numPC = sum( cpower <= tpower )
F = V(:,[M:-1:M-numPC+1]);
depochs = (F * F' * epochs')';
end
III – MULTIVARIATE METHODS
ICA
INTRODUCTION
ICA : If we listen to someone speaking about ICA,
whats all about?
 It’s a mathematical model.
 It’s a set of algorithms (fastICA, infomax).
 ICA is a method to decompose a set of signals
into a set of statistical independent
components (time or space).
Mixing
matrix
s1
s2 = a21s1 + a22s2
= a11s1 + a12s2
a11
a12
a21
a22
INTRODUCTION
 In matrix form:
 X = A S
 It seems complex because we only know X, so
we’ll need to do some assumptions on S.
Mixed signals
Source
signals
Mixing
matrix
INTRODUCTION : ICA IN EEG
WORDS
 From some EEG signals (xi) the problem is to
guess the mixing matrix A and components (si).
 The model assumes no delay and no distortion
through the environment.
0 500 1000 1500 2000 2500 3000 3500 4000
-80
-60
-40
-20
0
20
40
60
0 500 1000 1500 2000 2500 3000 3500 4000
-100
-80
-60
-40
-20
0
20
40
60
80
INTRODUCTION
 To get the blind source separation work we need
to be aware of some hints, restrictions and
pitfalls:
 The source signals need to be non-gaussian ! We’ll
see why. Just one source can be gaussian because the
sum of two gaussian random variables is another
gaussian random variable.
 The source strength cannot be estimated because the
ambiguity of the model: So, normalization to unit
variance:
 x1 = a11 s1 + a12 s2  so, we can always write:
 x1 = (a11/) (s1) + a12 s2
 x1 = (a11/λ) (λs1) + a12 s2
 with numbers:
 10 = 2 * 3 + 1*4 ; 10 = 1 * 6 + 1*4 ; 10 = 3 * 2 + 1*4
INTRODUCTION
 The implicit hypothesis when we apply ICA is
that the source signals are mixed linearly.
x = As ,
Mixing matrix A:
A = [1 -2; -1 1 ] ;
Source signals s:
s = [ 4 -3 2; 9 4 1]
Mixed signals x:
x = As = [-14 -11 0; 5 7 -1]
 In the real world we will try to estimate the
unkwon mixing matrix A with just knowing x.
0 0.5 1 1.5 2 2.5 3 3.5 4
-4
-3
-2
-1
0
1
2
3
4
5
0 0.5 1 1.5 2 2.5 3 3.5 4
0
1
2
3
4
5
6
7
8
9
10
0 0.5 1 1.5 2 2.5 3 3.5 4
-15
-10
-5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
-2
-1
0
1
2
3
4
5
6
7
8
=A
INDEPENDENCE &
UNCORRELATION
 But, what is independence ? & uncorrelation?
 Independence is the property that independent
events have in a probability framework. But…What
are independent events? In order to answer this
question another mathematical concept must be
introduced: conditional probability.
 Conditional probability of an event is the probability
of that event when other event has been observed.
p(A|B) = p(A,B)/p(B) when A,B not disjoint.
p(A|B) = p(A) when A,B disjoint.
INDEPENDENCE &
UNCORRELATION
 Let’s have a look at the following data cloud and
remember the following expression when X,Y are
independent:
P(x|y) = P(x)
P(x,y) = P(x)P(y)
 This means that the probability distribution or
”shape” of x does not depend on any given y.
 The joint probability is factorized by the
marginal probabilities.
CONDITIONAL PROBABILITY
 Dados dos eventos podemos definir la
probabilidad de B dado que se ha observado A y
se escribe p(B|A).
 p(B|A) = p(A,B)/p(A)
 It’s to say, conditional
probability is the joint of
both events divided by
probability of event A.
 In general p(B|A) ≠ p(A|B)
S
A
B
CONDITIONAL PROBABILITY
 Let’s do a simple example with a dice.
 S = {1,2,3,4,5,6} ; A={1,4} ; B={2,4,6} (even number)
 Everytime we toss the dice and the outcomes 1 or 4
are observed we can say that the event A has
ocurred.
 Everytime we toss the dice and the outcomes 2,4 or 6
are observed we can say that event B has ocurred.
 Assuming a fair dice p(i) = 1/6
 p(A) = 1/3
 p(B) = 1/2
 p(B|A) = p(B,A)/p(A) = (1/6)/(1/3) = 1/2
 p(A|B) = p(B,A)/p(B) = (1/6)/(1/2) = 1/3
3 5
6 21 4
INDEPENDENCE WITH 4 NUMBERS
 Let’s do an exercice of statistical independence
with just 4 numbers !
 Given the following contingency table:
 Are x,y independent?? We have to check that
p(x,y) = p(x)p(y) forall x,y
 The very first check fails: p(x1,y1) ≠ p(x1)p(y1)
0.1 ≠ 0.12
P(x,y)
y1 y2 P(x)
x1 0.1 0.1 0.2
x2 0.5 0.3 0.8
P(y) 0.6 0.4
INDEPENDENCE WITH 4 NUMBERS
 Other way to think about it:
 p(y|x=x1) = {1/5 1/5}
 p(y|x=x2) = {5/8 3/8}
 p(y) = {6/10 4/10}
 As we can see p(y|x1) ≠ p(y|x2) ≠ p(y) and so are not
independent because the probability distribution of y depends
on x.
 ¿Any questions?
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Simple example of independent source signals
and the results of the mixing process:
 Imagine an independent random data cloud (R2
)
with uniform distribution for both variables.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200
-20 0 20 40 60 80 100 120
0
100
200
%Create two random uniform
distributed
samples
>> s = 99*rand(2,1000);
>> plot(s(1,:),s(2,:), '.');
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Bivariate scatterplot from two normal
independent distributions.
-500 -400 -300 -200 -100 0 100 200 300 400 500
-600
-400
-200
0
200
400
600
-600
-400
-200
0
200
400
600
0 100 200
-500 -400 -300 -200 -100 0 100 200 300 400 500
0
100
200
%Create two random normal distributed
samples
>> s = 99*randn(2,1000);
>> plot(s(1,:),s(2,:), '.');
Can you see any direction
along which the fitting
error gets minimized? All
possible directions get the
same error variance.
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Bivariate scatterplot from two exponential
independent distributions.
 All curves along y axis are scaled versions of the
same distribution ! : p(x,y) = p(y)p(x)
x
y
-200 0 200 400 600 800 1000 1200
-200
0
200
400
600
800
1000
1200
-200
0
200
400
600
800
1000
1200
0 200 400
-200 0 200 400 600 800 1000 1200
0
200
400
%Exponential indep. random samples
>> s = 99* exprnd(1,2,1000);
>> plot(s(1,:),s(2,:), '.');
x
y
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 In last figures can you see any unique direction
along which you could draw a line?
This is because all of them are
independent  uncorrelated.
 The first signal in each plot has nothing to do
with the second one.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200
-20 0 20 40 60 80 100 120
0
100
200 -500 -400 -300 -200 -100 0 100 200 300 400 500
-600
-400
-200
0
200
400
600
-600
-400
-200
0
200
400
600
0 100 200
-500 -400 -300 -200 -100 0 100 200 300 400 500
0
100
200
-200 0 200 400 600 800
-200
0
200
400
600
800
1000
1200
-200
0
200
400
600
800
1000
1200
0 200 400
-200 0 200 400 600 800
0
200
400
>> s = 99*randn(2,1000);
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Let’s see a dependent joint probability
distribution scatterplot:
s1 = 99*rand(1,npoints);
%s2 values are dependent on s1 values !
for i=1:npoints
if(s1(i) > 45)
s2(i) = 99*normrnd(0,1);
else
s2(i) = 99*exprnd(1);
end
end
s = [s1; s2];
-20 0 20 40 60 80 100 120
-600
-400
-200
0
200
400
600
800
1000
-600
-400
-200
0
200
400
600
800
1000
0 200 400
-20 0 20 40 60 80 100 120
0
200
400
>> corrcoef(s(1,:), s(2,:) )
ans =
1.0000 -0.3666
-0.3666 1.0000
Although it seems to fit a
linear model the truth
behind the scatterplot is
that it’s not.
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 So, p(x,y) ≠ p(x) p(y) because y distribution
depends on x and viceversa.
 We get two different
shapes, exponential
and gaussian with
this two cuts:
P(20,y) ~ Exp.
P(80,y) ~ Normal.
x
y
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Mix these two independent signals with a mixing
matrix, substract mean and plot.
%Create two mixed random normal
distributed samples
>> s = 99*rand(2,1000);
>> A = [0.54 -0.84;0.12 -0.27];
>> x = A*s;
>> x_m = repmat( mean(x,2), 1, npoints );
>>x = x - x_m;
-100 -50 0 50 100
-30
-20
-10
0
10
20
30
-30
-20
-10
0
10
20
30
0 100 200 300
-100 -50 0 50 100
0
100
200
300
Can you see any
interesting direction in
this data cloud?
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Some points to notice:
 A maximum variance direction in the mixed data can
be seen.
 Marginal histograms have changed: are more
gaussian! , 3 letters should come to your mind: CLT.
 We can see the mixing row vectors as the edges of the
mixed data cloud.
 Data lost uncorrelation and independence.
-100 -50 0 50 100
-30
-20
-10
0
10
20
30
-30
-20
-10
0
10
20
30
0 100 200 300
-100 -50 0 50 100
0
100
200
300
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200
-20 0 20 40 60 80 100 120
0
100
200
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 CLT review: Imagine we have two dice and we
sum the outcome of them, we get the following
distribution:
>> dicesums = perm(1:6,1:6);
>> dicesums = reshape( dicesums, 1,prod(size(dicesums)));
>> [n c] = hist( dicesums,2:12 );
>> n = n ./sum(n);
>> stem(2:12,n);
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
2 3 4 5 6 7 8 9 10 11 12
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18 The sum becomes more
gaussian!
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 What would happen if rows of the mixing matrix
were orthogonal?.
 As we can see data remains uncorrelated but
independence is lost.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200 300
-20 0 20 40 60 80 100 120
0
100
200
300
-100 -80 -60 -40 -20 0 20 40 60 80 100
-60
-40
-20
0
20
40
-60
-40
-20
0
20
40
0 100 200 300
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
100
200
300
>> corrcoef(x')
ans =
1.0000 -0.0129
-0.0129 1.0000
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 What we will do now is whitening or sphering
our data. The goal of this process is:
 Uncorrelate data.
 Scale the variances of all variables to have unit
variance.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200 300
-20 0 20 40 60 80 100 120
0
100
200
300
-100 -50 0 50 100
-30
-20
-10
0
10
20
30
-30
-20
-10
0
10
20
30
0 100 200 300
-100 -50 0 50 100
0
100
200
300
-4 -3 -2 -1 0 1 2 3 4 5
-4
-3
-2
-1
0
1
2
3
4
5
-4
-3
-2
-1
0
1
2
3
4
5
0 100 200 300
-4 -3 -2 -1 0 1 2 3 4 5
0
100
200
300
Independent Correlated
Uncorrelated &
unit variance
Mixing whitening
WHITENING OPERATION
 Whitening is the mathematical opeation by
which data gets uncorrelated and unit variance.
 This operation is also called sphering.
 So, the covariance matrix after the
transformation gets transformed into an identity
matrix.
WHITENING OPERATION
 We can see the whitening operation as a linear
transformation T that gets data uncorrelated:
 x’=Tx , E[x’x’t
] = I E[Txxt
Tt
]TE[xxt
]Tt
=I
T = E[xxt
]-1/2
where E[xxt
] = cov(x) when data is
zero mean.
 Whitening operation is defined to be:
x’ = E[xxt
]-1/2
x
WHITENING OPERATION
 Whitening operation on a 10 channel EEG data.
 Covariance matrix before and after whitening.
%Covariance images
>> addpath functionsprob
>> cnt = ldcntb('c:s2-b1.cnt');
>> C = cov(cnt.dat);
>> imagesc( C );
>> wx = whiten(cnt.dat');
>> C = cov( wx' );
>> imagesc( C );
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
function wx = whiten( x )
npoints = length(x);
x_m = repmat( mean(x,2), 1, npoints );
x = x - x_m;
C = cov(x');
wx = inv(sqrtm(C))*x;
end
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 After the sphering/whitening procedure applied
on the mixed signals we are half the way of
getting the independent components. Just need
to rotate de data cloud. But...rotate up to what
point? To achieve the maximum of a certain cost
function, in ICA case, the nongaussianity!
 Minimizing gaussianity = Maximizing nongaussianity.
 The marginal probabilities will go away from the
normal distribution. We are walking backwards the
CLT.
rotation
 Now, we will project the whitened data cloud in
many different directions and will calculate the
kurtosis of projected points.
0 20 40 60 80 100 120 140 160 180
1
2
3
4
5
6
7
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
%Kurtosis for different projection angles.
>> wx = whiten( x );
>> plot(wx(1,:),wx(2,:), '.');
>> f = costfunction( @kurt, wx );
>> plot(f)
-8 -6 -4 -2 0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
-8
-6
-4
-2
0
2
4
0 100 200 300
-8 -6 -4 -2 0 2 4 6 8 10
0
100
200
300
≈ 135≈45
DATA PROJECTION REVIEW
 Histogram based on projected data points for
phase angle 0.
 So, we can calculate any statistic on this
distribution.

function cf = costfunction( f, data )
%1 degree in radiants.
alpha = pi/180;
%Number of datapoints
npoints = length(data);
%Rotation matrix
R = [cos(alpha) -sin(alpha); sin(alpha) cos(alpha)];
cf = zeros(1,180);
%Initial projection vector
pvector = [1;0];
for i=1:180
%Projection vector gets rotated
pvector = R*pvector;
%Project data onto projection vector
pdata = dot( repmat(pvector,1,npoints) , data);
%Calculate statistic and store value
cf(i) = f( pdata );
end
end
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
-8 -6 -4 -2 0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
-8
-6
-4
-2
0
2
4
0 100 200 300
-8 -6 -4 -2 0 2 4 6 8 10
0
100
200
300
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 If we repeat the procedure for a different sample
we get different shapes. One thing to notice is
that although the cost function have different
shapes, the two maximums are placed at the
same values!
0 20 40 60 80 100 120 140 160 180
1
2
3
4
5
6
7
0 20 40 60 80 100 120 140 160 180
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
0 20 40 60 80 100 120 140 160 180
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
-200 0 200 400 600 800 1000
-200
0
200
400
600
800
-200
0
200
400
600
800
0 100 200 300
-200 0 200 400 600 800 1000
0
100
200
300
-200 0 200 400 600 800 1000
-200
0
200
400
600
800
-200
0
200
400
600
800
0 100 200 300
-200 0 200 400 600 800 1000
0
100
200
300
 Let’s bootstrap the whole process to get an
insight of how it behaves for different resamples:
 500 kurtosis functions calculated from resampling for
projection angles from 0 to 180
INDEPENDENCE, UNCORRELATION,
DATA CLOUDS & BOOTSTRAP
-200 0 200 400 600 800 1000
-200
0
200
400
600
800
-200
0
200
400
600
800
0 100 200 300
-200 0 200 400 600 800 1000
0
100
200
300
mean
variance
0 20 40 60 80 100 120 140 160 180
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180
3
3.5
4
4.5
5
5.5
6
6.5
7
0 20 40 60 80 100 120 140 160 180
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 With the same procedure can be checked why
ICA doesn’t like Gaussian data, the problem
comes when trying to calculate the rotation angle
for Gaussian data:
-6 -4 -2 0 2 4 6
-4
-2
0
2
4
6
-4
-2
0
2
4
6
0 100 200 300
-6 -4 -2 0 2 4 6
0
100
200
300
0 20 40 60 80 100 120 140 160 180
-0.15
-0.1
-0.05
0
0.05
0.1
0 20 40 60 80 100 120 140 160 180
-0.35
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
-6 -4 -2 0 2 4 6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
0 100 200 300
-6 -4 -2 0 2 4 6
0
100
200
300
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
0 100 200 300
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
0
100
200
300
0 20 40 60 80 100 120 140 160 180
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
There’s no
structure in the
data
INDEPENDENCE,
UNCORRELATION , DATA CLOUDS &
BOOTSTRAP
 Bootstrap of the bivariate normal data cloud for
kurtosis:
-6 -4 -2 0 2 4 6
-4
-2
0
2
4
6
-4
-2
0
2
4
6
0 100 200 300
-6 -4 -2 0 2 4 6
0
100
200
300
0 20 40 60 80 100 120 140 160 180
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0 20 40 60 80 100 120 140 160 180
0
0.02
0.04
0.06
0.08
0.1
0.12
0 20 40 60 80 100 120 140 160 180
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
x 10
-3
mean
variance
INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Whole process with exponential (s1) and
gaussian (s2) data:
-500 -400 -300 -200 -100 0 100 200 300 400 500
-200
0
200
400
600
800
1000
1200
-200
0
200
400
600
800
1000
1200
0 100 200 300
-500 -400 -300 -200 -100 0 100 200 300 400 500
0
100
200
300
-1200 -1000 -800 -600 -400 -200 0 200 400 600
-350
-300
-250
-200
-150
-100
-50
0
50
100
150
-350
-300
-250
-200
-150
-100
-50
0
50
100
150
0 100 200 300
-1200 -1000 -800 -600 -400 -200 0 200 400 600
0
100
200
300
-10 -8 -6 -4 -2 0 2 4 6
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
0 100 200 300
-10 -8 -6 -4 -2 0 2 4 6
0
100
200
300 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
-10
-5
0
5
-10
-5
0
5
0 100 200 300
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
0
100
200
300
Cannot recover
sign and order !
Mixing
whitening
rotation
ICA & PCA WITH IMAGES
 PCA is not able to recover independence:
NonGaussian
EXAMPLE WITH EEG SIGNALS
 Up to now we fixed the mixing matrix and source
signals. With the forward model (x=As) we got
the mixed signals. Now we face the inverse
problem. From two EEG signals the problem is to
to guess the mixing matrix A.
 Let’s have a look at the scatterplot.
0 500 1000 1500 2000 2500 3000 3500 4000
-80
-60
-40
-20
0
20
40
60
0 500 1000 1500 2000 2500 3000 3500 4000
-100
-80
-60
-40
-20
0
20
40
60
80
%Channel 2 & 3
cnt = ldcntb( 'c:s2-b1.cnt' );
s1 = cnt.dat(1:4000,2);
s2 = cnt.dat(1:4000,3);
s = [s1 s2]';
EXAMPLE WITH EEG SIGNALS
 It seems that these two channels are correlated.
 The next step would be to whiten the data.
-100 -50 0 50 100
-150
-100
-50
0
50
100
-150
-100
-50
0
50
100
0 200 400 600 800
-100 -50 0 50 100
0
200
400
600
800
EXAMPLE WITH EEG SIGNALS
 Whitened signals
-3 -2 -1 0 1 2 3 4
-4
-3
-2
-1
0
1
2
3
4
5
6
-4
-3
-2
-1
0
1
2
3
4
5
6
0 200 400 600 800
-3 -2 -1 0 1 2 3 4
0
200
400
600
800
EXAMPLE WITH EEG SIGNALS
 Rotate to get the maximum value of kurtosis.
-3 -2 -1 0 1 2 3 4 5
-5
-4
-3
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
2
3
4
5
0 200 400 600 800
-3 -2 -1 0 1 2 3 4 5
0
200
400
600
800

Contenu connexe

Tendances

4 2 continuous probability distributionn
4 2 continuous probability    distributionn4 2 continuous probability    distributionn
4 2 continuous probability distributionnLama K Banna
 
Probability Distributions
Probability Distributions Probability Distributions
Probability Distributions Anthony J. Evans
 
Normal distribution
Normal distributionNormal distribution
Normal distributionAlok Roy
 
The normal distribution
The normal distributionThe normal distribution
The normal distributionShakeel Nouman
 
C2 st lecture 11 the t-test handout
C2 st lecture 11   the t-test handoutC2 st lecture 11   the t-test handout
C2 st lecture 11 the t-test handoutfatima d
 
Different types of distributions
Different types of distributionsDifferent types of distributions
Different types of distributionsRajaKrishnan M
 
Descriptive Statistics Part II: Graphical Description
Descriptive Statistics Part II: Graphical DescriptionDescriptive Statistics Part II: Graphical Description
Descriptive Statistics Part II: Graphical Descriptiongetyourcheaton
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testingjemille6
 
Probability Distributions for Continuous Variables
Probability Distributions for Continuous VariablesProbability Distributions for Continuous Variables
Probability Distributions for Continuous Variablesgetyourcheaton
 
Week8 livelecture2010 follow_up
Week8 livelecture2010 follow_upWeek8 livelecture2010 follow_up
Week8 livelecture2010 follow_upBrent Heard
 
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...Gilbert Joseph Abueg
 
C2 st lecture 9 probability handout
C2 st lecture 9   probability handoutC2 st lecture 9   probability handout
C2 st lecture 9 probability handoutfatima d
 
Probability distribution
Probability distributionProbability distribution
Probability distributionPunit Raut
 

Tendances (20)

Continuous probability distribution
Continuous probability distributionContinuous probability distribution
Continuous probability distribution
 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
 
4 2 continuous probability distributionn
4 2 continuous probability    distributionn4 2 continuous probability    distributionn
4 2 continuous probability distributionn
 
Guia de estudio para aa5
Guia de estudio  para aa5 Guia de estudio  para aa5
Guia de estudio para aa5
 
Probability Distributions
Probability Distributions Probability Distributions
Probability Distributions
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
The normal distribution
The normal distributionThe normal distribution
The normal distribution
 
C2 st lecture 11 the t-test handout
C2 st lecture 11   the t-test handoutC2 st lecture 11   the t-test handout
C2 st lecture 11 the t-test handout
 
Different types of distributions
Different types of distributionsDifferent types of distributions
Different types of distributions
 
Descriptive Statistics Part II: Graphical Description
Descriptive Statistics Part II: Graphical DescriptionDescriptive Statistics Part II: Graphical Description
Descriptive Statistics Part II: Graphical Description
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
 
Statistics 1 revision notes
Statistics 1 revision notesStatistics 1 revision notes
Statistics 1 revision notes
 
Discrete Probability Distributions.
Discrete Probability Distributions.Discrete Probability Distributions.
Discrete Probability Distributions.
 
Chapter15
Chapter15Chapter15
Chapter15
 
Probability Distributions for Continuous Variables
Probability Distributions for Continuous VariablesProbability Distributions for Continuous Variables
Probability Distributions for Continuous Variables
 
Week8 livelecture2010 follow_up
Week8 livelecture2010 follow_upWeek8 livelecture2010 follow_up
Week8 livelecture2010 follow_up
 
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
CABT SHS Statistics & Probability - Expected Value and Variance of Discrete P...
 
C2 st lecture 9 probability handout
C2 st lecture 9   probability handoutC2 st lecture 9   probability handout
C2 st lecture 9 probability handout
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
 
Chi Square & Anova
Chi Square & AnovaChi Square & Anova
Chi Square & Anova
 

En vedette

The Complexity of Data: Computer Simulation and “Everyday” Social Science
The Complexity of Data: Computer Simulation and “Everyday” Social ScienceThe Complexity of Data: Computer Simulation and “Everyday” Social Science
The Complexity of Data: Computer Simulation and “Everyday” Social ScienceEdmund Chattoe-Brown
 
Simulation - Generating Continuous Random Variables
Simulation - Generating Continuous Random VariablesSimulation - Generating Continuous Random Variables
Simulation - Generating Continuous Random VariablesMartin Kretzer
 
WEBINAR: Using Scrum for Hardware Development
WEBINAR: Using Scrum for Hardware DevelopmentWEBINAR: Using Scrum for Hardware Development
WEBINAR: Using Scrum for Hardware DevelopmentBelatrix Software
 
computer simulation
computer simulationcomputer simulation
computer simulationnadiaz08
 
Unit 3 random number generation, random-variate generation
Unit 3 random number generation, random-variate generationUnit 3 random number generation, random-variate generation
Unit 3 random number generation, random-variate generationraksharao
 
Computer simulation
Computer simulationComputer simulation
Computer simulationGudia Khan
 
Powerpoint presentation on computer simulation,blended learning and education...
Powerpoint presentation on computer simulation,blended learning and education...Powerpoint presentation on computer simulation,blended learning and education...
Powerpoint presentation on computer simulation,blended learning and education...rado001
 
Computer Simulation And Modeling
Computer Simulation And ModelingComputer Simulation And Modeling
Computer Simulation And ModelingPakistan Loverx
 
Computer modelling and simulations
Computer modelling and simulationsComputer modelling and simulations
Computer modelling and simulationstangytangling
 
computer based information system
computer based information systemcomputer based information system
computer based information systemjandian
 
[216]딥러닝예제로보는개발자를위한통계 최재걸
[216]딥러닝예제로보는개발자를위한통계 최재걸[216]딥러닝예제로보는개발자를위한통계 최재걸
[216]딥러닝예제로보는개발자를위한통계 최재걸NAVER D2
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 

En vedette (15)

The Complexity of Data: Computer Simulation and “Everyday” Social Science
The Complexity of Data: Computer Simulation and “Everyday” Social ScienceThe Complexity of Data: Computer Simulation and “Everyday” Social Science
The Complexity of Data: Computer Simulation and “Everyday” Social Science
 
Simulation - Generating Continuous Random Variables
Simulation - Generating Continuous Random VariablesSimulation - Generating Continuous Random Variables
Simulation - Generating Continuous Random Variables
 
WEBINAR: Using Scrum for Hardware Development
WEBINAR: Using Scrum for Hardware DevelopmentWEBINAR: Using Scrum for Hardware Development
WEBINAR: Using Scrum for Hardware Development
 
Probability And Random Variable Lecture(Lec9)
Probability And Random Variable Lecture(Lec9)Probability And Random Variable Lecture(Lec9)
Probability And Random Variable Lecture(Lec9)
 
computer simulation
computer simulationcomputer simulation
computer simulation
 
Random variate generation
Random variate generationRandom variate generation
Random variate generation
 
Computer simulation
Computer simulationComputer simulation
Computer simulation
 
Unit 3 random number generation, random-variate generation
Unit 3 random number generation, random-variate generationUnit 3 random number generation, random-variate generation
Unit 3 random number generation, random-variate generation
 
Computer simulation
Computer simulationComputer simulation
Computer simulation
 
Powerpoint presentation on computer simulation,blended learning and education...
Powerpoint presentation on computer simulation,blended learning and education...Powerpoint presentation on computer simulation,blended learning and education...
Powerpoint presentation on computer simulation,blended learning and education...
 
Computer Simulation And Modeling
Computer Simulation And ModelingComputer Simulation And Modeling
Computer Simulation And Modeling
 
Computer modelling and simulations
Computer modelling and simulationsComputer modelling and simulations
Computer modelling and simulations
 
computer based information system
computer based information systemcomputer based information system
computer based information system
 
[216]딥러닝예제로보는개발자를위한통계 최재걸
[216]딥러닝예제로보는개발자를위한통계 최재걸[216]딥러닝예제로보는개발자를위한통계 최재걸
[216]딥러닝예제로보는개발자를위한통계 최재걸
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 

Similaire à Statistical Methods

2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.WeihanKhor2
 
Statistik 1 5 distribusi probabilitas diskrit
Statistik 1 5 distribusi probabilitas diskritStatistik 1 5 distribusi probabilitas diskrit
Statistik 1 5 distribusi probabilitas diskritSelvin Hadi
 
Discrete Probability Distributions
Discrete Probability DistributionsDiscrete Probability Distributions
Discrete Probability Distributionsmandalina landy
 
Chapter 4 part2- Random Variables
Chapter 4 part2- Random VariablesChapter 4 part2- Random Variables
Chapter 4 part2- Random Variablesnszakir
 
Communication Theory - Random Process.pdf
Communication Theory - Random Process.pdfCommunication Theory - Random Process.pdf
Communication Theory - Random Process.pdfRajaSekaran923497
 
Point Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsPoint Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsUniversity of Salerno
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_NotesLu Mao
 
Econometrics 2.pptx
Econometrics 2.pptxEconometrics 2.pptx
Econometrics 2.pptxfuad80
 
Discrete probability
Discrete probabilityDiscrete probability
Discrete probabilityRanjan Kumar
 
SAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxSAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxanhlodge
 
Probability distribution for Dummies
Probability distribution for DummiesProbability distribution for Dummies
Probability distribution for DummiesBalaji P
 

Similaire à Statistical Methods (20)

U unit7 ssb
U unit7 ssbU unit7 ssb
U unit7 ssb
 
S t a t i s t i c s
S t a t i s t i c sS t a t i s t i c s
S t a t i s t i c s
 
2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.
 
Statistik 1 5 distribusi probabilitas diskrit
Statistik 1 5 distribusi probabilitas diskritStatistik 1 5 distribusi probabilitas diskrit
Statistik 1 5 distribusi probabilitas diskrit
 
Probability
ProbabilityProbability
Probability
 
Discrete Probability Distributions
Discrete Probability DistributionsDiscrete Probability Distributions
Discrete Probability Distributions
 
Chapter 4 part2- Random Variables
Chapter 4 part2- Random VariablesChapter 4 part2- Random Variables
Chapter 4 part2- Random Variables
 
Communication Theory - Random Process.pdf
Communication Theory - Random Process.pdfCommunication Theory - Random Process.pdf
Communication Theory - Random Process.pdf
 
Talk 3
Talk 3Talk 3
Talk 3
 
Basics of Statistics
Basics of StatisticsBasics of Statistics
Basics of Statistics
 
Point Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis testsPoint Estimate, Confidence Interval, Hypotesis tests
Point Estimate, Confidence Interval, Hypotesis tests
 
PTSP PPT.pdf
PTSP PPT.pdfPTSP PPT.pdf
PTSP PPT.pdf
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
Makalah ukuran penyebaran
Makalah ukuran penyebaranMakalah ukuran penyebaran
Makalah ukuran penyebaran
 
Probability[1]
Probability[1]Probability[1]
Probability[1]
 
Econometrics 2.pptx
Econometrics 2.pptxEconometrics 2.pptx
Econometrics 2.pptx
 
Inorganic CHEMISTRY
Inorganic CHEMISTRYInorganic CHEMISTRY
Inorganic CHEMISTRY
 
Discrete probability
Discrete probabilityDiscrete probability
Discrete probability
 
SAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docxSAMPLING MEAN DEFINITION The term sampling mean .docx
SAMPLING MEAN DEFINITION The term sampling mean .docx
 
Probability distribution for Dummies
Probability distribution for DummiesProbability distribution for Dummies
Probability distribution for Dummies
 

Statistical Methods

  • 2. STATISTICAL METHODS  Fundamentals  Basic probability review.  Random variable.  Function of a Random variable.  Sampling distribution.  Central Limit Theorem.  Sample mean & Sample Variance
  • 3. STATISTICAL METHODS  Inference.  Introduction  Confidence interval  with known variance  with unknown variance  Bootstrap  Introduction  Examples
  • 4. STATISTICAL METHODS  Multivariate methods (PCA).  Vector basis.  Orthogonal projection of a vector.  Eigenvector & eigenvalue.  Numerical approximation.  Closed-form solution via covariance matrix.  SVD decomposition.  Applications  Dimensionality reduction.
  • 5. STATISTICAL METHODS  Multivariate methods (ICA).  Independence and uncorrelation  Whitening  ICA Example
  • 7. WHAT ‘S A RANDOM EXPERIMENT?  Task that might lead to different outcomes.  An elemental event is each of the possible outcomes of an experiment. Ex:  Tossing a coin has two events: {head, tail}  Tossing two coins has four events:{hh,ht,th,tt}  The set of all possible outcomes is called the sample space. Ex:  The sample space for rolling a dice is: S = {1,2,3,4,5,6}  An event is any subset of the sample space.  We define the event “odd number” in the experiment rolling a dice: Sodd = {1,3,5}
  • 8. WHAT’S A PROBABILITY?  Every sample space has its related probability space.  When we carry out an experiment several times, the relative frequency of an event is the quotient between the number of times the event is verified and the total number of times we repeated it. Ex:  We repeat the experiment “tossing a coin” 10 times and this is the outcome: { t,t,t,h,t,h,h,t,h,t}  Freq(t) = n(t)/N = 6/10 = 0.6  Freq(h) = n(h)/N = 4/10 = 0.4  The probability is the limit of the relative frequency when Ninf
  • 9. PROBABILITY LIMIT IN A TOSSING COIN EXP. (BERNOULLI EXP.)  Plot of the relative frequency of the outcome “head” in a coin toss experiment as a function of the number of repetitions.  Freq(head) = n(head)/N
  • 10. PROBABILITY AXIOMS & PROPERTIES  Axioms:  P(A) > 0  P(S) = 1  Given A1,..,An mutually exclusive events:  P( A1 U A2 U … U An) = P(A1) + … + P(An)  Properties:  P(ø) = 0  P(Ac ) = 1-P(A)  If A subset B  P(A) ≤ P(B)  P(A U B) = P(A) + P(B) – P(A∩B)
  • 11. RANDOM VARIABLE  A rv is a variable whose value, a number, is a function of the outcome of a random experiment. Ex:  For a coin toss, the possible events are heads or tails. The number of heads appearing in one (fair) coin toss can be described using the following random variable: 1 if head X = 0 if tail Head Tail 1 0
  • 12. RANDOM VARIABLE METAPHOR  It’s like a blackbox that gives you an unkown number everytime you ask for it.  These numbers somehow are related!. The frequency of appearance when the sequence is long enough is the probability density/mass function. The longer the sequence is ,the more accurate information we have about the random variable. R.V. 1,2,5,1,7,5,1,2,1,5,1,4
  • 13. RANDOM VARIABLE  We can classificate random variables in two big sets:  Discrete random variable: Sample space is a countable set: integers.  Continous random variable: All possible values in the sample space is in a real numbers range.
  • 14. PROBABILITY MASS/DENSITY FUNCTION  All outcomes of the random variable (sample space) have an associated probability via the probability mass/density function. P( X = x )
  • 15. FUNCTION OF A RANDOM VARIABLE  A function of a random variable is another random variable : Y = g(X)  Because it is a random variable it has a probability mass/density function.  If X is the random variable associated with rolling a dice:  We could define the random variable Y = f(X) = 2X. The possible values of this new random variable are {2,4,6,8,10,12} with probability 1/6 each. Which is PY(y)?  Another function we could define is Y = g(X) = { 1 if x is even, 0 if x is odd}, which are Sy and PY(y) ?
  • 16. EXPECTATION AND VARIANCE OF A RANDOM VARIABLE  Given X a random variable we define the expectation as the function mass centre and the variance as the spread around this centre.
  • 17. MEAN & VARIANCE OF A RANDOM VARIABLE E[X] & VAR[X] E[X] = μ = Σ x P(X = x) Var[X] = E[(X- μ)2 ] = Σ (x - μ)2 P(X = x)  Ex: X~bin(7,0.4) function E = expectation( sample_space, pmf ) E = sum( sample_space .* pmf ); end function V = variance( sample_space, pmf ) E = expectation(sample_space,pmf); V = sum((sample_space-E).^2 .* pmf); end >> n = 7; >> p = 0.4; >> S = 0:n; >> pdf = binopdf(S, n, p); >> stem(S,pdf); >> expectation(S,pdf) ans = 2.8000 >> variance(S,pdf) ans =1.6800
  • 18. SKEWNESS & KURTOSIS  The skewness is a measure of asymmetry of the probability distribution.  Is the third standarized moment around the mean.
  • 19. SKEWNESS & KURTOSIS  It is a measure of normality/peakedness of the random variable.It is the fourth standarized moment around the mean:μ4/4  Sometimes it’s defined as μ4/4 -3, it’s a correction to make the kurtosis of the normal distribution equal to zero.  Distributions with 0 excess kurtosis are called mesokurtic.  Distributions with positive excess kurtosis are called leptokurtic.  Distributions with negative excess kurtosis are called platykurtic.
  • 21. SAMPLING DISTRIBUTION  A sampling distribution is the distribution of a given statistic (r.v. function) based on a random sample of size n.  The sampling distribution depends on the underlying distribution of the population, the statistics being considered and the sample size used. f g
  • 22. SAMPLING DISTRIBUTION  For example: consider a normal population with mean μ and variance 2 . Assume we take samples of size n from this population and calculate the arithmetic mean for each sample (sample mean statistic).  Each sample will have its own average value and the distribution of these averages will be called “sampling distribution of the sample mean”.  This distribution will be normal N(μ, 2 /n).  For other statistics and other populations the formulas are frequently more complicated and sometimes they don’t even exist in closed-form.
  • 23. CENTRAL LIMIT THEOREM  Let Sn be the sum of n random variables i.i.d.: Sn = X1 + …+ Xn (Xi ~ ANY distribution) E[Sn] = n μ Var[Sn] = n 2  We define Zn as: Zn = (Sn – nμ)/ ( √n)  As n grows to infinity Zn converges to N(0,1).  As n grows the kurtosis will converge to 0. So, the more additions (n) the more normal will be the random variable.
  • 26. NUMERICAL DEMONSTRATION  As N grows the distribution converges to N(0,1). >> out = sample_mean( 2); >> kurt(out) ans = -0.5745 >> out = sample_mean( 5 ); >> kurt(out) ans = -0.1728 >> out = sample_mean( 10 ); >> kurt(out) ans = -0.1104 >> out = sample_mean( 20 ); >> kurt(out) ans = -0.0802 N = 2 N = 5 N = 20
  • 27. No matter the source distribution. NUMERICAL DEMONSTRATION >> samples = sample_mean(2,'exp'); >> kurt(samples) ans = 3.1135 >> skewness( samples ) ans = 1.3566 >> samples = sample_mean(5,'exp'); >> kurt(samples) ans = 1.1265 >> skewness( samples ) ans = 0.8995 >> samples = sample_mean(10,'exp'); >> kurt(samples) ans = 0.6619 >> skewness( samples ) ans = 0.7450 >> samples = sample_mean(50,'exp'); >> kurt(samples) ans = 0.1188 >> skewness( samples ) ans = 0.2686 N = 2 N = 5 N = 10 N = 50
  • 28. SAMPLE MEAN  Given a sequence of random variables (iid) X1, …,Xn we define the sample mean as: Xm = (X1 + …+Xn)/n  As Xm is a function of random variables then it is a random variable with a probability mass/density function associated.  For a long n (n>30) the probability density is normal because of the CLT.  E[Xm] = E[Xi] = μ  Var(Xm) = Var(Xi)/n = 2 /n  SD(Xm) = /√n  Xm ~ N(μ , 2 /n)
  • 29.  Standard deviation/Uncertainty reduction around population mean with 1/√n relation. SAMPLE MEAN
  • 30. SAMPLE VARIANCE  Xm ~ N(μ , 2 /n)  This random variable spreads out around the population mean μ, so it is an unbiased estimator.  S2 = 1/(n-1) Σ(x-μ)2 (Cochran’s theorem) S2 ~ 2 /(n-1)χ2 n-1  Probability density distribution of χ2 k = Σi=1 k Xi 2 , Xi ~N(0,1)
  • 31. Given a uniform distribution [-1, 1]  its standard deviation is  = 1/√3 and its kurtosis is -6/5. E[S2 ]= 2 =1/3 Var(S2 ) = 4 (2/(n-1) + Kurt/n)= 4 ( 2/(n-1) + μ4/(n4 ) ) SAMPLE VARIANCE %Uniforme de -1 a 1. samples = unifrnd(-1,1,1,N); mu = 0; sigma = sqrt(1/3); N = 4 N = 20 N = 200
  • 32. SAMPLE VARIANCE, NUMERICAL EXAMPLE >> [smean svar] = sample_mean_var(4,'uni'); >> var(svar) ans = 0.0410 >> ((1/3)^2)*(2/3 -6/20) ans = 0.0407 >> [smean svar] = sample_mean_var(5,'uni'); >> var(svar) ans = 0.0287 >> ((1/3)^2)*(2/4 -6/25) ans = 0.0289 >> [smean svar] = sample_mean_var(10,'uni'); >> var(svar) ans = 0.0113 >> ((1/3)^2)*(2/9 -6/50) ans = 0.0114 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 2 2.5 3 N = 2 N = 3 N = 4 N = 5 N = 6
  • 34.  Statistical inference is the process of making conclusions from datasets arising from systems affected by random variation.  Inference makes propositions about populations, using data drawn from the population of interest via some form of sampling. INFERENCE population sample We’ll have to deal with this partial information!
  • 35. SAMPLING CONSEQUENCES  The goal behind inference is to determine whether an observed effect, such as difference between two means or the correlation between two variables , could reasonably be attributed to the randomness in selecting the sample.  If not, we have evidence that the effect observed in the sample reflects an effect that is present in the population. Population 1 Sample Population 2 Sample ? ?? ? ? ?
  • 36. 1,3,6,7,2, 0,-3,-5 SAMPLING EXAMPLE  Imagine we have the following population and we take a sample of size 3 and compute the sample mean:  Remember sampling distribution (theoretical or empirical)!! 1,3,6,7,2,0,-3,-5 7,6,3 1.37 5.33 So!, we need something to tell us whether 5.33 makes sense as a population mean or not!! We need to answer this questioninference.
  • 37. CONFIDENCE INTERVAL  It’s an interval estimate of a population parameter. Ex: population mean.  Instead of estimating the parameter by a single value, an interval likely to include the interval is given.  How likely the interval is to contain the parameter is given by the confidence level.  Increasing the confidence level will widen the confidence interval.  The end points of the confidence interval are referred to as confidence limits and these ending points are random variables as well and the interval is a random variable too.
  • 38. CONFIDENCE INTERVAL  The confidence level usually is 0.95 o 0.99. The intuition behind this number is that if we repeated the experiment/measure 100 times, 95 of them the interval would hold the true parameter value.  Imagine we know population’s standard deviation  (from previous studies):  Z = (Xm – μ)/(/√n) ~ N(0,1)  We can select two symmetrical points -z/2 and z/2 for which P(-z/2 ≤ Z ≤ z/2 ) = 1-  P( Xm - z/2 /√n ≤ μ ≤ Xm + z/2 /√n) = 1-  So the probability of μ belonging to the CI is 1-  Random confidence interval : [Xm +/- z/2 /√n ] First, let’s have a look at the theoretical-oriented sampling distribution and hence to theoretical confidence intervals !!
  • 40. CONFIDENCE INTERVAL: 1-FZ(X) TABLE There are different tables to express the same!!
  • 41. CONFIDENCE INTERVAL The horizontal line segments represent 100 realisations of a confidence interval for μ. Acceptance region Critical region
  • 42. CONFIDENCE INTERVAL WITH KNOWN VARIANCE  What does P(μ є [Xm +/- z/2 /√n ]) = 1- mean?  We know that by repetition in 100(1- )% of the cases μ will be in the calculated interval.  In practice we only have one repetition of the experiment and so a single confidence interval. This is why we rely on our interval to hold the parameter and it will in 100(1- )% of the cases!  The CI length is: 2z/2 /√n and our interest will be to get this interval the shorter as possible.  The confidence interval length grows as the confidence level grows and viceversa however we should keep it in reasonable limits (95%, 99%).
  • 43. CONFIDENCE INTERVAL WITH KNOWN VARIANCE  The estimation will be more accurate as the population variance is smaller which means that the population sampling is more homogeneous.  The length of the interval decreases as we increase the sampling size.
  • 44. CONFIDENCE INTERVAL WITH UNKNOWN VARIANCE  The confidence interval if we don’t know the population variance can be estimated with the sample variance S2 , CI: [Xm +/- tn-1,/2 S/√n ] T Student distrib
  • 45. BOOTSTRAP INTRODUCTION  The revolution in computing is having a dramatic influence on statistics  Statistical study of very large and very complex data sets becomes feasible (fMRI analysis).  These methods, bootstrap confidence intervals and permutation tests apply computing power to relax some of the conditions needed for traditional inference (normality).  The main goal is to compute bias, variability and confidence intervals for which theory doesn’t provide closed form solutions (formulas).  Closed form solutions are replaced by brute force computing. Now, let’s have a look at the data-oriented sampling distribution !!
  • 46. BOOTSTRAP INTRODUCTION  It’s a procedure in which computer simulation through resampling data replaces mathematical analysis.  The big idea behind is: The sample is an estimate of the population (if large enough), so take the sample as if it is the population itself. Population Sample estimate Our real truth about the population!!
  • 47. BOOTSTRAP INTRODUCTION  So let’s treat the data as a proxy for the true distribution !!  In Bradley Efron words:  “Bootstrapping requires very little in the way of modeling, assumptions, or analysis, and can be applied in an automatic way to any situation, no matter how complicated”  “An important theme is the substitution of raw computing power for theoretical analysis”
  • 48. BOOTSTRAP PROCEDURE  The big idea: Statistical inference is based on the sampling distributions of sample statistics (Ex: sample mean). The bootstrap is first of all a way of finding the sampling distribution from just one sample. The bootstrap procedure/algorithm:  Step 1: Resampling: A sampling distribution is based on many random samples from the population. In place of many samples from the population, create many resamples by repeatedly sampling with replacement from this random sample. 3.12 0.00 1.57 19.67 0.22 2.20 mean = 4.46 1.57 0.22 19.67 0.00 0.22 3.12 mean = 4.13 0.00 2.20 2.20 2.20 19.67 1.57 mean = 4.64 0.22 3.12 1.57 3.12 2.20 0.22 mean = 1.74
  • 49. BOOTSTRAP PROCEDURE  Step 2: Bootstrap distribution: The sampling distribution of a statistic (data function) collects the values of the statistic from many samples. The bootstrap distribution of a statistic collects its values from many resamples.The bootstrap distribution gives information about the sampling distribution. Fboot ≈ Fsampling.The sampling distribution is the key object to answer questions.  Step 3: Repeat many times steps 1 and 2.  This basic procedure can be scripted with few programming lines with any high level language of your taste (Matlab, R, C++, …) !!
  • 50. BOOTSTRAP’S PICTURE FOR SAMPLE MEAN 0 5 10 15 20 25 30 35 40 45 50 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 Population Sample Resamples 0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0 5 10 15 20 25 30 35 40 45 50 0 0.05 0.1 0.15 0.2 0.25 Bootstrap sample mean distribution 0 5 10 15 20 25 30 35 40 45 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Sample mean sampling distribution
  • 51. BOOTSTRAP MATLAB SCRIPT(EASY ONE) %The bootstrap method calculates the bootstrap distribution %of the statistic under study. % %sample : Sample data array. %statistic : handler to the statistic function. function boot_samples = bootstrap( sample, statistic_handle ) N = 10000; boot_samples = zeros(1,N); for j = 1:N rsample = resample( sample ); boot_samples(j) = statistic_handle( rsample ); end plot_samples( boot_samples ); end %resampling the original sample with replacement function rsample = resample( sample ) n = length( sample ); random_idx = unidrnd(n,1,n); rsample = sample( random_idx ); end
  • 52. BOOTSTRAP EXAMPLES  There are some function in the Matlab statistics toolbox for bootstraping : bootstrp, bootci. >> which bootstrp C:Program FilesMATLABR2007atoolboxstatsbootstrp.m >> which bootci C:Program FilesMATLABR2007atoolboxstatsbootci.m  Let’s do some examples!:  Bootstraping a correlation coefficient.  Bootstraping a standard error of the mean.  Bootstrapping two sample means.  Bootstrapping the confidence interval of the intercept regression coefficient.
  • 53. BOOTSTRAP EXAMPLE: CORRELATION  There are some available datasets in matlab to work with, in the statistical toolbox there are many :  acetylene.mat : Chemical reaction data with correlated predictors.  arrhythmia.mat : Cardiac arrhythmia data from the UCI machine learning repository.  cities.mat : Quality of life ratings for U.S. metropolitan areas.  lawdata.mat : Grade point average and LSAT scores from 15 law schools.  …  We will work with lawdata.mat which has two variables GPA (Grade Point Average) & LSAT (test designed to measure skills that are considered essential for success in law school).  Are these two variables related in some way in a law school??
  • 54. BOOTSTRAP EXAMPLE: CORRELATION >> load lawdata >> plot_hist(lsat,'b',10); plot_hist(gpa,'r',10); >> mean(lsat) ans = 600.2667 >> mean(gpa) ans = 3.0947 540 560 580 600 620 640 660 680 0 0.005 0.01 0.015 0.02 0.025 0.03 LSAT 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 0 0.5 1 1.5 2 2.5 3 GPA
  • 55. BOOTSTRAP EXAMPLE: CORRELATION >> load lawdata >> plot_hist(lsat,'b',10); plot_hist(gpa,'r',10); >> scatter(gpa,lsat,'filled'); >> [r p] = corrcoef(lsat,gpa) r = 1.0000 0.7764 0.7764 1.0000 p = 1.0000 0.0007 0.0007 1.0000 >> bsamples = bootstrap({@(x,y) submat(corrcoef(x,y),1,'1',2,'2'), @stat_resample},95,gpa,lsat); Confidence interval [0.4584 0.96127] >> plot_samples( bsamples ); 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 0 0.5 1 1.5 Standard Normal Quantiles QuantilesofInputSample QQ Plot of Sample Data versus Standard Normal 0.0007 < 0.05 0 is not in the ci  So these two variables are positively correlated! 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 540 560 580 600 620 640 660 680 Observed statistic 540 560 580 600 620 640 660 680 0 0.005 0.01 0.015 0.02 0.025 0.03 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 0 0.5 1 1.5 2 2.5 3
  • 56.  To estimate the P-value for a test of significance we have to estimate the sampling distribution of the statistic when the null hypothesis is true So we have to resample in a manner that is consistent with H0. BOOTSTRAP EXAMPLE: CORRELATION >> ci = bootci(10000,{@(x,y) submat(corrcoef(x,y),1,'1',2,'2'),lsat,gpa},'type','per') ci = 0.4594 0.9606 With Matlab built-in functions (bootci) and the same method (percentile) the result is almost the same.
  • 57. BOOTSTRAP EXAMPLE : CORRELATION  To resample in a manner that is consistent with the null hypothesis we will merge groups & randomly resample with replacement. Control Treatment 3.12 0.00 1.57 19.67 20.22 18.20 Mc = 1.563 Mt= 19.36 Merge 3.12 0.00 1.57 19.67 20.22 18.20 Mc = 10.46 Resample consistent with H0 Control Treatment 20.22 18.20 0.00 1.57 3.12 19.67 Mc = 12.80 Mt = 8.12 Resample consistent with H0 Control Treatment 3.12 19.67 20.22 18.20 0.00 1.57 Mc = 14.33 Mt = 6.59 Resample consistent with H0 Control Treatment 3.12 20.22 1.57 18.20 0.00 19.67 Mc = 8.30 Mt = 12.62 To resample in a way that is consistent with the null hypothesis: imitate many repetitions of the random assignment of subjects to “treatment” and “control” groups.
  • 58. BOOTSTRAP EXAMPLE : CORRELATION >> h0bsamples=bootstrap({@(x,y) submat(corrcoef(x,y),1,'1',2,'2'), @stat_resample_asH0},95,gpa,lsat); >> plot_hist( h0bsamples, 'b' ); >> hold on >> plot_hist( bsamples, 'r' ); -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 3.5 -1.5 -1 -0.5 0 0.5 1 1.5 0 0.5 1 1.5 2 2.5 3 3.5 Observed statistic Sampling distrib | H0 Sampling distrib | H1 Now we know how the distribution of our statistic looks like in both cases.
  • 59. BOOTSTRAP EXAMPLE : CORRELATION  The desired P-value is then estimated as:  P-value = {t* > t*obs}/B , B = #iterations  If P-value <   Reject H0.  If P-value >   Accept H0.  Matlab simple script to get this value: %P-value of the statistic given a H0 sampling distribution. function H0 = pvalue( bsamples, observed_stat, alpha ) p_value = sum( bsamples > observed_stat)/length(bsamples); H0 = p_value > alpha; end >> [H0 p_value] = pvalue( h0bsamples, 0.776, 0.05 ) H0 = 0 p_value = 6.0000e-004 >> [r p] = corrcoef(lsat,gpa) r = 1.0000 0.7764 0.7764 1.0000 So, we reject H0 and can say that these two variables are statistically correlated with a confidence level of 0.05!!
  • 60. BOOTSTRAP EXAMPLE : STANDARD ERROR OF THE MEAN (SEM)  We will start from a known situation in which we know from theory the value of the standard error of the mean and we will estimate the same value by bootstrapping from a sample.  Supose we have a population following the model ~N(25,4.5)  We take 5000 samples from this population. 5 10 15 20 25 30 35 40 45 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 >> samples = normrnd(25,4.5,5000,1); >> plot_hist( samples ); >> mean(samples) ans = 25.0568 >> std(samples) ans = 4.4253 Sample mean estimate !
  • 61. BOOTSTRAP EXAMPLE : SEM  From theory we kown that : SEM = /√n = 4.5/sqrt(5000) = 0.0636  As we can see the true value (25) belongs to the confidence interval, as it should! >> bsamples = bootstrap({@mean, @stat_resample},95,samples); Confidence interval [24.9338 25.18]. >> plot_norm_samples( bsamples ); 24 24.2 24.4 24.6 24.8 25 25.2 25.4 25.6 25.8 26 0 2 4 6 8 -4 -3 -2 -1 0 1 2 3 4 24.8 25 25.2 25.4 25.6 Standard Normal Quantiles QuantilesofInputSample QQ Plot of Sample Data versus Standard Normal
  • 62. BOOTSTRAP EXAMPLE : SEM  Let’s make an estimation of the standard error of the mean by calculating the sample standard deviation of the bootstrap sampling distribution for the sample mean statistic. Ok,wait and repeat it again in your head …  It’s not so different from the theoretical value: 0.0636!!. To sum up: >> std( bsamples ) ans = 0.0631 >> 4.5/sqrt(5000) ans = 0.0636 >> std(samples)/sqrt(5000) ans = 0.0626 >> std( bsamples ) ans = 0.0631 Theoretical value Traditional statistics value. Bootstrap value.
  • 63. BOOTSTRAP EXAMPLE : TWO SAMPLE MEANS  Imagine that we have two different samples: The hypothesis test that we have to answer is whether these two samples are taken from the same distribution or not.  H0 : F = G  H1 : F ≠ G Population F = G Sample1 Sample2 Population F Sample1 Sample2Population G
  • 64. BOOTSTRAP EXAMPLE : TWO SAMPLE MEANS  If the distribution is parametrical we can formulate in terms of the parameters: F = F(x;θ1), G = F(x,θ2) (Ex: θ is expectation)  H0 : θ1 = θ2  H1: θ1 ≠ θ2 0 5 10 15 20 25 30 35 40 45 50 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Sample1 Sample2 0 5 10 15 20 25 30 35 40 45 50 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Sample1 Sample2 Non-effect scenario Effect scenario
  • 65. BOOTSTRAP EXAMPLE : TWO SAMPLE MEANS >> csamples = normrnd( 25, 3, 50, 1); >> tsamples = normrnd( 26, 3, 100, 1); >> plot_hist( csamples ); >> plot_hist( tsamples, 'r' ); >> mean(tsamples) - mean(csamples) ans = 1.0648 >> h0bsamples = bootstrap({@(x,y)(mean(x)-mean(y)), @stat_resample_asH0},tsamples, csamples); >> bsamples = bootstrap({@(x,y)(mean(x)-mean(y)), @stat_resample},tsamples, csamples); >> [h0 p_value] = pvalue( h0bsamples, mean(tsamples) - mean(csamples), 0.05 ) h0 = 0 p_value = 0.0261 >> [fiho,xiho] = ksdensity( h0bsamples, 'npoints',500 ); >> plot( xiho, fiho ); 10 15 20 25 30 35 40 0 0.05 0.1 0.15 0.2 10 15 20 25 30 35 40 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Is this difference significant? -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 sample mean difference | H0 Sample mean difference| H0 1.0648 The smaller the p-value, the stronger is the evidence for H0 to be false. In some texts p-value is called the Achieved Significance Level (ASL): P( Tboot> tobs | H0) As 0.026 < 0.05 we reject H0
  • 66. BOOTSTRAP EXAMPLE : REGRESSION COEFFICIENT CONFIDENCE INTERVAL  We will try to answer the following question:  How many points in LSAT test should someone improve to add 1 point to the GPA value??  How can we answer to this question?  With the slope coefficient of a regression fit!  So we can write GPA as a linear function of LSAT values: x:LSAT, y = GPA  y = ax +b y(x+c) – y(x) = 1  ax+ac+b-ax-b = 1   ac = 1  c = 1/a  So the inverse of the slope of the regression line will be our number!
  • 67. BOOTSTRAP EXAMPLE : REGRESSION COEFFICIENT CONFIDENCE INTERVAL >> load lawdata; >> x = [ones(size(lsat)) lsat]; >> y = gpa; >> b = regress(y,x) b = 0.3794 0.0045 >> yfit = x*b; >> scatter(x(:,2),y); >> hold on; >> plot( x(:,2), yfit,'r'); >> plot( x(:,2), yfit,'or'); Design matrix 540 560 580 600 620 640 660 680 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 GPA LSAT >> resid = y - yfit; >> plot_hist( resid, 'b', 5 ); -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 Residuals, we will assume white noise. We will bootstrap this values into the regression model!
  • 68. BOOTSTRAP EXAMPLE : REGRESSION COEFFICIENT CONFIDENCE INTERVAL >> bsamples = bootstrap({@(rsamples) submat(regress(yfit+rsamples,x),1,'2'), @stat_resample},95,resid); Confidence interval [0.0026841 0.0064508]. >> mean(bsamples) ans = 0.0045 >> 1/0.0045 ans = 222.2222
  • 69. III – MULTIVARIATE METHODS PCA
  • 70. INTRODUCTION  PCA is a mathematical method to find out interesting directions in data clouds.  These interesting directions are called principal components.  Later on we will give an interpretation for interesting. By now are just somehow interesting.
  • 71. PCA INTRODUCTION  Given a data table with variables in columns and observations in rows, the data cloud is the set of points resulting from reading each row as a vector. -10 -5 0 5 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
  • 72. VECTORS  Do you remember what a vector is?  Do you remember what the euclidean norm of a vector is?  Do you remember cartesian and polar coordinates of a vector?  Do you remember what a scalar product is between two vectors?.
  • 73. VECTOR SPACE  A vector space is the whole set of vectors spanned by the vector space basis.  Which means that any vector x in the vector space can be expressed as a linear combination of basis elements (basis vectors) where mk are scalars and uk are the basis elements (vectors).  x = m1u1 + …+ mnun  To span means that you generate any vector by adding and multiplying by a scalar the basis vectors.
  • 74. VECTOR PROJECTION  Imagine we have vector p and v and want to project p into v:  pv is the nearest point to p in the v direction.  <v,p> = |v| |p| cos()  |pv| = |p| cos()  <v,p> = |v|| pv | | pv | = <v,p>/|v|  pv = | pv | v/|v|  pv = <v,p>/|v| v/|v| <v,p>/<v,v> v v p pv
  • 75. VECTOR PROJECTION 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 0 1 2 x y z >> o = [0 0 0]; >> p0 = [2,1,2]; >> p1 = [3 -1 1]; >> vectarrow(o,p0); >> hold on; >> vectarrow(o,p1); >> p0_p1 = (p0.*p1)/(p1.*p1) * p1; >> vectarrow(o,p0_p1); p0 p1 p0_p1
  • 76. VECTOR BASIS  Suppose we have a vector x expressed in two different basis B1 & B2  B1 = { u1 ,u2 ,u3 ,…, un } , B2 = { v1 ,v2 ,v3 ,…, vn }  xB1 = (m1, m2, …, mn)  xB2 = (n1, n2, …, nn)  x = m1u1 + …+ mnun  x = n1v1 + …+ nnvn  We can write ui vectors in B1 as linear combination of B2 vectors:  u1 = a11 v1 + a12 v2 + ... + a1n vn  u2 = a21 v1 + a22 v2 + ... + a2n vn  …
  • 77. VECTOR BASIS  So:  u1B2 = (a11, a12, a13, …, a1n)  unB2 = (an1, an2, an3, …, ann)  We can substitute these new vectors in x:  x = m1(a11 v1 + a12 v2 + ... + a1n vn) + …+ mn(an1 v1 + an2 v2 + ... + ann vn)  x = (m1a11 + m2a21 + ... + mnan1) v1 + …+ (m1a1n + m2a2n + ... + mnann) vn  By inspection we can see that:  n1 = m1a11 + m2a21 + ... + mnan1  n2 = m1a12 + m2a22 + ... + mnan2  …
  • 78. VECTOR BASIS  In matrix form:  xB2 = A xB1  But, how do we calculate these aij numbers?  Solving a linear system of equations.  If the basis vectors are orthogonal projecting ui over vj vector:  aij = < ui , vj >/< vj , vj > = uixvjx + uiyvjy
  • 79. VECTOR BASIS: EASY EXAMPLE 1  We have a basis B={(2,1), (1,4) } and a vector x=(5,6) in canonical coordinates. We want to get the vector x expressed as a vector in B basis.  x=(5,6)  xB = (a,b)  (5,6) = a(2,1) + b(1,4) 5 = 2a + b -20 = -8a – 4b 6 = a + 4b  6 = a + 4b  -14 =-7a  a = 2 b = 1 So, xB = (2,1)
  • 80. VECTOR BASIS: EASY EXAMPLE 2  We have a vector x = (2,3)  B1 : { (1,0) , (0,1) }  x = 2(1,0) + 3(0,1)  B2 : { (-1,0) , (0,2) }  x = -2 (-1,0) + 3/2 (0,2)  xB2 = (-2,3/2)  B3 : { (2,1), (-1,2) }  x = 7/5 (2,1) + 4/5 (-1,2)  xB3 = (7/5,4/5) B1 B2 B3
  • 81. EIGENVECTORS & EIGENVALUES  Suppose A= , then A =  So Av is a reflected vector around y axis. -1 0 0 1 x1 x2 -x1 x2 x1 x2 -x1 x2 v =
  • 82. EIGENVECTORS & EIGENVALUES  We observe that: A = -1 A = 1  Thus, the vectors on the coordinate axes are mapped to vectors on the coordinate axes. For those vectors exists a scalar λ so that: A = λ x1 0 x1 0 0 x2 0 x2 x1 x2 x1 x2 The direction of the vector remains invariant to the transformation A !
  • 83. EIGENVECTORS & EIGENVALUES  Example: Suppose A = Then is an eigenvector corresponding to eigenvalue λ = 4. = = 4 = 2 3 2 1 3 2 2 3 2 1 3 2 12 8 3 2 2 3 2 1 1 3 11 5 Is it an eigenvector?
  • 84. EIGENVECTORS & EIGENVALUES  To sum up with a picture: Which one is the eigenvector? Eigenvectors are vectors which direction is invariant to the transformation matrix!
  • 85. OBJECTIVE FUNCTION.  A function associated with an optimization problem which determines how good a solution is.  PCA as an optimization problem:  We want to get the direction of the projection that maximizes the variance of the projected values which means getting the direction in which much variance lies on.  w1 = arg maxw { var(wxt ) } where w is a projection row vector and x is the data matrix with each row as a point. Which is the same as minimizing: θ1 = arg max θ { var( (cos θ,sinθ)xt ) }  w1 is the principal component of our data cloud.
  • 86.  This figure shows the projection of the data cloud to the w direction. The histogram of the projected values is plotted in blue..  Remember! : xt : Data cloud. w: Projection direction. wxt : Projected cloud. So, which is best w?  arg maxw{ var(wxt ) } DATA CLOUD PROJECTION w
  • 87.  Let’s do the same with several projection vectors.  We get different distributions for projected data.  In red you can see the direction of maximum variance: w* = maxw{ var(wxt ) }  For the second w2 component we would look for maximizing the same function but with a restriction: orthogonal to w1 . DATA CLOUD PROJECTION w*
  • 88. PCA: ALGORITHM  Remove mean from data.  Calculate covariance matrix.  SVD decomposition: A way to get eigenvalues & eigenvectors of the covariance matrix.  Select eigenvectors sorted by eigenvalue.  Project to this new space to reduce dimensionality.  Project back to original space.  Add mean to data.
  • 90. PCA & EEG  We will illustrate PCA to denoise EEG data. The algorithm showed here should be improved but it’s a naive approximation to understand PCA.  Each epoch/sweep can be seen as an N dimension vector where N is the number of timepoints. 0 20 40 60 80 100 120 140 160 180 -20 -10 0 10 20 30 40
  • 91. PCA & EEG  All sweeps for my condition build a data cloud in N-space. 0 20 40 60 80 100 120 140 160 180 -20 -10 0 10 20 30 40 0 20 40 60 80 100 120 140 160 180 -20 -10 0 10 20 30 40 0 20 40 60 80 100 120 140 160 180 -20 -10 0 10 20 30 40 0 20 40 60 80 100 120 140 160 180 -20 -10 0 10 20 30 40 D ata cloud
  • 92. PCA & EEG  As now we’re dealing with vectors and not with functions any more we can apply a PCA method on them. 0 20 40 60 80 100 120 140 160 180 -20 -10 0 10 20 30 40 >> [trials timeline] = gettrials( 'c:s2-b1.cnt', 2, 5, 0.30 ); >> plot(trials(15,:)); >> [dtrials F] = denoise_pca( trials ); >> plot(dtrials(15,:),'r');
  • 93. PCA & EEG  We can plot also some principal components of this set of trials: -50 0 50 100 150 200 250 300 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 -50 0 50 100 150 200 250 300 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 -50 0 50 100 150 200 250 300 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2
  • 94. PCA & EEG function [depochs F] = denoise_pca( epochs ) dim = size(epochs); %numero de muestras N = dim(1); %numero de variables M=dim(2); C = cov( epochs ); [V D] = eig( C ); cpower = diag(D); %Reducimos la dimensionalidad a numPC tpower = sum(cpower) * 0.85; tpower = sum(cpower) * 0.85; cpower = cpower(end:-1:1); cpower = cumsum(cpower); numPC = sum( cpower <= tpower ) F = V(:,[M:-1:M-numPC+1]); depochs = (F * F' * epochs')'; end
  • 95. III – MULTIVARIATE METHODS ICA
  • 96. INTRODUCTION ICA : If we listen to someone speaking about ICA, whats all about?  It’s a mathematical model.  It’s a set of algorithms (fastICA, infomax).  ICA is a method to decompose a set of signals into a set of statistical independent components (time or space). Mixing matrix s1 s2 = a21s1 + a22s2 = a11s1 + a12s2 a11 a12 a21 a22
  • 97. INTRODUCTION  In matrix form:  X = A S  It seems complex because we only know X, so we’ll need to do some assumptions on S. Mixed signals Source signals Mixing matrix
  • 98. INTRODUCTION : ICA IN EEG WORDS  From some EEG signals (xi) the problem is to guess the mixing matrix A and components (si).  The model assumes no delay and no distortion through the environment. 0 500 1000 1500 2000 2500 3000 3500 4000 -80 -60 -40 -20 0 20 40 60 0 500 1000 1500 2000 2500 3000 3500 4000 -100 -80 -60 -40 -20 0 20 40 60 80
  • 99. INTRODUCTION  To get the blind source separation work we need to be aware of some hints, restrictions and pitfalls:  The source signals need to be non-gaussian ! We’ll see why. Just one source can be gaussian because the sum of two gaussian random variables is another gaussian random variable.  The source strength cannot be estimated because the ambiguity of the model: So, normalization to unit variance:  x1 = a11 s1 + a12 s2  so, we can always write:  x1 = (a11/) (s1) + a12 s2  x1 = (a11/λ) (λs1) + a12 s2  with numbers:  10 = 2 * 3 + 1*4 ; 10 = 1 * 6 + 1*4 ; 10 = 3 * 2 + 1*4
  • 100. INTRODUCTION  The implicit hypothesis when we apply ICA is that the source signals are mixed linearly. x = As , Mixing matrix A: A = [1 -2; -1 1 ] ; Source signals s: s = [ 4 -3 2; 9 4 1] Mixed signals x: x = As = [-14 -11 0; 5 7 -1]  In the real world we will try to estimate the unkwon mixing matrix A with just knowing x. 0 0.5 1 1.5 2 2.5 3 3.5 4 -4 -3 -2 -1 0 1 2 3 4 5 0 0.5 1 1.5 2 2.5 3 3.5 4 0 1 2 3 4 5 6 7 8 9 10 0 0.5 1 1.5 2 2.5 3 3.5 4 -15 -10 -5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 -2 -1 0 1 2 3 4 5 6 7 8 =A
  • 101. INDEPENDENCE & UNCORRELATION  But, what is independence ? & uncorrelation?  Independence is the property that independent events have in a probability framework. But…What are independent events? In order to answer this question another mathematical concept must be introduced: conditional probability.  Conditional probability of an event is the probability of that event when other event has been observed. p(A|B) = p(A,B)/p(B) when A,B not disjoint. p(A|B) = p(A) when A,B disjoint.
  • 102. INDEPENDENCE & UNCORRELATION  Let’s have a look at the following data cloud and remember the following expression when X,Y are independent: P(x|y) = P(x) P(x,y) = P(x)P(y)  This means that the probability distribution or ”shape” of x does not depend on any given y.  The joint probability is factorized by the marginal probabilities.
  • 103. CONDITIONAL PROBABILITY  Dados dos eventos podemos definir la probabilidad de B dado que se ha observado A y se escribe p(B|A).  p(B|A) = p(A,B)/p(A)  It’s to say, conditional probability is the joint of both events divided by probability of event A.  In general p(B|A) ≠ p(A|B) S A B
  • 104. CONDITIONAL PROBABILITY  Let’s do a simple example with a dice.  S = {1,2,3,4,5,6} ; A={1,4} ; B={2,4,6} (even number)  Everytime we toss the dice and the outcomes 1 or 4 are observed we can say that the event A has ocurred.  Everytime we toss the dice and the outcomes 2,4 or 6 are observed we can say that event B has ocurred.  Assuming a fair dice p(i) = 1/6  p(A) = 1/3  p(B) = 1/2  p(B|A) = p(B,A)/p(A) = (1/6)/(1/3) = 1/2  p(A|B) = p(B,A)/p(B) = (1/6)/(1/2) = 1/3 3 5 6 21 4
  • 105. INDEPENDENCE WITH 4 NUMBERS  Let’s do an exercice of statistical independence with just 4 numbers !  Given the following contingency table:  Are x,y independent?? We have to check that p(x,y) = p(x)p(y) forall x,y  The very first check fails: p(x1,y1) ≠ p(x1)p(y1) 0.1 ≠ 0.12 P(x,y) y1 y2 P(x) x1 0.1 0.1 0.2 x2 0.5 0.3 0.8 P(y) 0.6 0.4
  • 106. INDEPENDENCE WITH 4 NUMBERS  Other way to think about it:  p(y|x=x1) = {1/5 1/5}  p(y|x=x2) = {5/8 3/8}  p(y) = {6/10 4/10}  As we can see p(y|x1) ≠ p(y|x2) ≠ p(y) and so are not independent because the probability distribution of y depends on x.  ¿Any questions?
  • 107. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Simple example of independent source signals and the results of the mixing process:  Imagine an independent random data cloud (R2 ) with uniform distribution for both variables. -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 0 100 200 -20 0 20 40 60 80 100 120 0 100 200 %Create two random uniform distributed samples >> s = 99*rand(2,1000); >> plot(s(1,:),s(2,:), '.');
  • 108. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Bivariate scatterplot from two normal independent distributions. -500 -400 -300 -200 -100 0 100 200 300 400 500 -600 -400 -200 0 200 400 600 -600 -400 -200 0 200 400 600 0 100 200 -500 -400 -300 -200 -100 0 100 200 300 400 500 0 100 200 %Create two random normal distributed samples >> s = 99*randn(2,1000); >> plot(s(1,:),s(2,:), '.'); Can you see any direction along which the fitting error gets minimized? All possible directions get the same error variance.
  • 109. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Bivariate scatterplot from two exponential independent distributions.  All curves along y axis are scaled versions of the same distribution ! : p(x,y) = p(y)p(x) x y -200 0 200 400 600 800 1000 1200 -200 0 200 400 600 800 1000 1200 -200 0 200 400 600 800 1000 1200 0 200 400 -200 0 200 400 600 800 1000 1200 0 200 400 %Exponential indep. random samples >> s = 99* exprnd(1,2,1000); >> plot(s(1,:),s(2,:), '.'); x y
  • 110. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  In last figures can you see any unique direction along which you could draw a line? This is because all of them are independent  uncorrelated.  The first signal in each plot has nothing to do with the second one. -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 0 100 200 -20 0 20 40 60 80 100 120 0 100 200 -500 -400 -300 -200 -100 0 100 200 300 400 500 -600 -400 -200 0 200 400 600 -600 -400 -200 0 200 400 600 0 100 200 -500 -400 -300 -200 -100 0 100 200 300 400 500 0 100 200 -200 0 200 400 600 800 -200 0 200 400 600 800 1000 1200 -200 0 200 400 600 800 1000 1200 0 200 400 -200 0 200 400 600 800 0 200 400 >> s = 99*randn(2,1000);
  • 111. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Let’s see a dependent joint probability distribution scatterplot: s1 = 99*rand(1,npoints); %s2 values are dependent on s1 values ! for i=1:npoints if(s1(i) > 45) s2(i) = 99*normrnd(0,1); else s2(i) = 99*exprnd(1); end end s = [s1; s2]; -20 0 20 40 60 80 100 120 -600 -400 -200 0 200 400 600 800 1000 -600 -400 -200 0 200 400 600 800 1000 0 200 400 -20 0 20 40 60 80 100 120 0 200 400 >> corrcoef(s(1,:), s(2,:) ) ans = 1.0000 -0.3666 -0.3666 1.0000 Although it seems to fit a linear model the truth behind the scatterplot is that it’s not.
  • 112. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  So, p(x,y) ≠ p(x) p(y) because y distribution depends on x and viceversa.  We get two different shapes, exponential and gaussian with this two cuts: P(20,y) ~ Exp. P(80,y) ~ Normal. x y
  • 113. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Mix these two independent signals with a mixing matrix, substract mean and plot. %Create two mixed random normal distributed samples >> s = 99*rand(2,1000); >> A = [0.54 -0.84;0.12 -0.27]; >> x = A*s; >> x_m = repmat( mean(x,2), 1, npoints ); >>x = x - x_m; -100 -50 0 50 100 -30 -20 -10 0 10 20 30 -30 -20 -10 0 10 20 30 0 100 200 300 -100 -50 0 50 100 0 100 200 300 Can you see any interesting direction in this data cloud?
  • 114. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Some points to notice:  A maximum variance direction in the mixed data can be seen.  Marginal histograms have changed: are more gaussian! , 3 letters should come to your mind: CLT.  We can see the mixing row vectors as the edges of the mixed data cloud.  Data lost uncorrelation and independence. -100 -50 0 50 100 -30 -20 -10 0 10 20 30 -30 -20 -10 0 10 20 30 0 100 200 300 -100 -50 0 50 100 0 100 200 300 -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 0 100 200 -20 0 20 40 60 80 100 120 0 100 200
  • 115. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  CLT review: Imagine we have two dice and we sum the outcome of them, we get the following distribution: >> dicesums = perm(1:6,1:6); >> dicesums = reshape( dicesums, 1,prod(size(dicesums))); >> [n c] = hist( dicesums,2:12 ); >> n = n ./sum(n); >> stem(2:12,n); 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 2 3 4 5 6 7 8 9 10 11 12 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 The sum becomes more gaussian!
  • 116. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  What would happen if rows of the mixing matrix were orthogonal?.  As we can see data remains uncorrelated but independence is lost. -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 0 100 200 300 -20 0 20 40 60 80 100 120 0 100 200 300 -100 -80 -60 -40 -20 0 20 40 60 80 100 -60 -40 -20 0 20 40 -60 -40 -20 0 20 40 0 100 200 300 -100 -80 -60 -40 -20 0 20 40 60 80 100 0 100 200 300 >> corrcoef(x') ans = 1.0000 -0.0129 -0.0129 1.0000
  • 117. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  What we will do now is whitening or sphering our data. The goal of this process is:  Uncorrelate data.  Scale the variances of all variables to have unit variance. -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 -20 0 20 40 60 80 100 120 0 100 200 300 -20 0 20 40 60 80 100 120 0 100 200 300 -100 -50 0 50 100 -30 -20 -10 0 10 20 30 -30 -20 -10 0 10 20 30 0 100 200 300 -100 -50 0 50 100 0 100 200 300 -4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5 0 100 200 300 -4 -3 -2 -1 0 1 2 3 4 5 0 100 200 300 Independent Correlated Uncorrelated & unit variance Mixing whitening
  • 118. WHITENING OPERATION  Whitening is the mathematical opeation by which data gets uncorrelated and unit variance.  This operation is also called sphering.  So, the covariance matrix after the transformation gets transformed into an identity matrix.
  • 119. WHITENING OPERATION  We can see the whitening operation as a linear transformation T that gets data uncorrelated:  x’=Tx , E[x’x’t ] = I E[Txxt Tt ]TE[xxt ]Tt =I T = E[xxt ]-1/2 where E[xxt ] = cov(x) when data is zero mean.  Whitening operation is defined to be: x’ = E[xxt ]-1/2 x
  • 120. WHITENING OPERATION  Whitening operation on a 10 channel EEG data.  Covariance matrix before and after whitening. %Covariance images >> addpath functionsprob >> cnt = ldcntb('c:s2-b1.cnt'); >> C = cov(cnt.dat); >> imagesc( C ); >> wx = whiten(cnt.dat'); >> C = cov( wx' ); >> imagesc( C ); 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 function wx = whiten( x ) npoints = length(x); x_m = repmat( mean(x,2), 1, npoints ); x = x - x_m; C = cov(x'); wx = inv(sqrtm(C))*x; end
  • 121. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  After the sphering/whitening procedure applied on the mixed signals we are half the way of getting the independent components. Just need to rotate de data cloud. But...rotate up to what point? To achieve the maximum of a certain cost function, in ICA case, the nongaussianity!  Minimizing gaussianity = Maximizing nongaussianity.  The marginal probabilities will go away from the normal distribution. We are walking backwards the CLT. rotation
  • 122.  Now, we will project the whitened data cloud in many different directions and will calculate the kurtosis of projected points. 0 20 40 60 80 100 120 140 160 180 1 2 3 4 5 6 7 INDEPENDENCE, UNCORRELATION & DATA CLOUDS %Kurtosis for different projection angles. >> wx = whiten( x ); >> plot(wx(1,:),wx(2,:), '.'); >> f = costfunction( @kurt, wx ); >> plot(f) -8 -6 -4 -2 0 2 4 6 8 10 -8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4 0 100 200 300 -8 -6 -4 -2 0 2 4 6 8 10 0 100 200 300 ≈ 135≈45
  • 123. DATA PROJECTION REVIEW  Histogram based on projected data points for phase angle 0.  So, we can calculate any statistic on this distribution. 
  • 124. function cf = costfunction( f, data ) %1 degree in radiants. alpha = pi/180; %Number of datapoints npoints = length(data); %Rotation matrix R = [cos(alpha) -sin(alpha); sin(alpha) cos(alpha)]; cf = zeros(1,180); %Initial projection vector pvector = [1;0]; for i=1:180 %Projection vector gets rotated pvector = R*pvector; %Project data onto projection vector pdata = dot( repmat(pvector,1,npoints) , data); %Calculate statistic and store value cf(i) = f( pdata ); end end INDEPENDENCE, UNCORRELATION & DATA CLOUDS -8 -6 -4 -2 0 2 4 6 8 10 -8 -6 -4 -2 0 2 4 -8 -6 -4 -2 0 2 4 0 100 200 300 -8 -6 -4 -2 0 2 4 6 8 10 0 100 200 300
  • 125. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  If we repeat the procedure for a different sample we get different shapes. One thing to notice is that although the cost function have different shapes, the two maximums are placed at the same values! 0 20 40 60 80 100 120 140 160 180 1 2 3 4 5 6 7 0 20 40 60 80 100 120 140 160 180 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 0 20 40 60 80 100 120 140 160 180 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 -200 0 200 400 600 800 1000 -200 0 200 400 600 800 -200 0 200 400 600 800 0 100 200 300 -200 0 200 400 600 800 1000 0 100 200 300 -200 0 200 400 600 800 1000 -200 0 200 400 600 800 -200 0 200 400 600 800 0 100 200 300 -200 0 200 400 600 800 1000 0 100 200 300
  • 126.  Let’s bootstrap the whole process to get an insight of how it behaves for different resamples:  500 kurtosis functions calculated from resampling for projection angles from 0 to 180 INDEPENDENCE, UNCORRELATION, DATA CLOUDS & BOOTSTRAP -200 0 200 400 600 800 1000 -200 0 200 400 600 800 -200 0 200 400 600 800 0 100 200 300 -200 0 200 400 600 800 1000 0 100 200 300 mean variance 0 20 40 60 80 100 120 140 160 180 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 120 140 160 180 3 3.5 4 4.5 5 5.5 6 6.5 7 0 20 40 60 80 100 120 140 160 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
  • 127. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  With the same procedure can be checked why ICA doesn’t like Gaussian data, the problem comes when trying to calculate the rotation angle for Gaussian data: -6 -4 -2 0 2 4 6 -4 -2 0 2 4 6 -4 -2 0 2 4 6 0 100 200 300 -6 -4 -2 0 2 4 6 0 100 200 300 0 20 40 60 80 100 120 140 160 180 -0.15 -0.1 -0.05 0 0.05 0.1 0 20 40 60 80 100 120 140 160 180 -0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 -6 -4 -2 0 2 4 6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 100 200 300 -6 -4 -2 0 2 4 6 0 100 200 300 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 0 100 200 300 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 100 200 300 0 20 40 60 80 100 120 140 160 180 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 There’s no structure in the data
  • 128. INDEPENDENCE, UNCORRELATION , DATA CLOUDS & BOOTSTRAP  Bootstrap of the bivariate normal data cloud for kurtosis: -6 -4 -2 0 2 4 6 -4 -2 0 2 4 6 -4 -2 0 2 4 6 0 100 200 300 -6 -4 -2 0 2 4 6 0 100 200 300 0 20 40 60 80 100 120 140 160 180 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0 20 40 60 80 100 120 140 160 180 0 0.02 0.04 0.06 0.08 0.1 0.12 0 20 40 60 80 100 120 140 160 180 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 x 10 -3 mean variance
  • 129. INDEPENDENCE, UNCORRELATION & DATA CLOUDS  Whole process with exponential (s1) and gaussian (s2) data: -500 -400 -300 -200 -100 0 100 200 300 400 500 -200 0 200 400 600 800 1000 1200 -200 0 200 400 600 800 1000 1200 0 100 200 300 -500 -400 -300 -200 -100 0 100 200 300 400 500 0 100 200 300 -1200 -1000 -800 -600 -400 -200 0 200 400 600 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 0 100 200 300 -1200 -1000 -800 -600 -400 -200 0 200 400 600 0 100 200 300 -10 -8 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 0 100 200 300 -10 -8 -6 -4 -2 0 2 4 6 0 100 200 300 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 -10 -5 0 5 -10 -5 0 5 0 100 200 300 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0 100 200 300 Cannot recover sign and order ! Mixing whitening rotation
  • 130. ICA & PCA WITH IMAGES  PCA is not able to recover independence: NonGaussian
  • 131. EXAMPLE WITH EEG SIGNALS  Up to now we fixed the mixing matrix and source signals. With the forward model (x=As) we got the mixed signals. Now we face the inverse problem. From two EEG signals the problem is to to guess the mixing matrix A.  Let’s have a look at the scatterplot. 0 500 1000 1500 2000 2500 3000 3500 4000 -80 -60 -40 -20 0 20 40 60 0 500 1000 1500 2000 2500 3000 3500 4000 -100 -80 -60 -40 -20 0 20 40 60 80 %Channel 2 & 3 cnt = ldcntb( 'c:s2-b1.cnt' ); s1 = cnt.dat(1:4000,2); s2 = cnt.dat(1:4000,3); s = [s1 s2]';
  • 132. EXAMPLE WITH EEG SIGNALS  It seems that these two channels are correlated.  The next step would be to whiten the data. -100 -50 0 50 100 -150 -100 -50 0 50 100 -150 -100 -50 0 50 100 0 200 400 600 800 -100 -50 0 50 100 0 200 400 600 800
  • 133. EXAMPLE WITH EEG SIGNALS  Whitened signals -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 5 6 -4 -3 -2 -1 0 1 2 3 4 5 6 0 200 400 600 800 -3 -2 -1 0 1 2 3 4 0 200 400 600 800
  • 134. EXAMPLE WITH EEG SIGNALS  Rotate to get the maximum value of kurtosis. -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 200 400 600 800 -3 -2 -1 0 1 2 3 4 5 0 200 400 600 800