Statistical Methods

STATISTICAL METHODS
Probability review/Fundamentals
Enric Cecilla
Brainlab

STATISTICAL METHODS
 Fundamentals
 Basic probability review.
 Random variable.
 Function of a Random variable.
 Sampling distribution.
 Central Limit Theorem.
 Sample mean & Sample Variance

STATISTICAL METHODS
 Inference.
 Introduction
 Confidence interval
 with known variance
 with unknown variance
 Bootstrap
 Introduction
 Examples

STATISTICAL METHODS
 Multivariate methods (PCA).
 Vector basis.
 Orthogonal projection of a vector.
 Eigenvector & eigenvalue.
 Numerical approximation.
 Closed-form solution via covariance matrix.
 SVD decomposition.
 Applications
 Dimensionality reduction.

STATISTICAL METHODS
 Multivariate methods (ICA).
 Independence and uncorrelation
 Whitening
 ICA Example

WHAT ‘S A RANDOM EXPERIMENT?
 Task that might lead to different outcomes.
 An elemental event is each of the possible
outcomes of an experiment. Ex:
 Tossing a coin has two events: {head, tail}
 Tossing two coins has four events:{hh,ht,th,tt}
 The set of all possible outcomes is called the
sample space. Ex:
 The sample space for rolling a dice is:
S = {1,2,3,4,5,6}
 An event is any subset of the sample space.
 We define the event “odd number” in the experiment
rolling a dice:
Sodd = {1,3,5}

WHAT’S A PROBABILITY?
 Every sample space has its related probability
space.
 When we carry out an experiment several times,
the relative frequency of an event is the quotient
between the number of times the event is verified
and the total number of times we repeated it. Ex:
 We repeat the experiment “tossing a coin” 10 times
and this is the outcome: { t,t,t,h,t,h,h,t,h,t}
 Freq(t) = n(t)/N = 6/10 = 0.6
 Freq(h) = n(h)/N = 4/10 = 0.4
 The probability is the limit of the relative
frequency when Ninf

PROBABILITY LIMIT IN A TOSSING
COIN EXP. (BERNOULLI EXP.)
 Plot of the relative frequency of the outcome
“head” in a coin toss experiment as a function of
the number of repetitions.
 Freq(head) = n(head)/N

PROBABILITY AXIOMS &
PROPERTIES
 Axioms:
 P(A) > 0
 P(S) = 1
 Given A1,..,An mutually exclusive events:
 P( A1 U A2 U … U An) = P(A1) + … + P(An)
 Properties:
 P(ø) = 0
 P(Ac
) = 1-P(A)
 If A subset B  P(A) ≤ P(B)
 P(A U B) = P(A) + P(B) – P(A∩B)

RANDOM VARIABLE
 A rv is a variable whose value, a number, is a
function of the outcome of a random experiment.
Ex:
 For a coin toss, the possible events are heads or tails.
The number of heads appearing in one (fair) coin toss
can be described using the following random variable:
1 if head
X =
0 if tail
Head
Tail
1
0

RANDOM VARIABLE METAPHOR
 It’s like a blackbox that gives you an unkown number
everytime you ask for it.
 These numbers somehow are related!. The frequency
of appearance when the sequence is long enough is
the probability density/mass function. The longer the
sequence is ,the more accurate information we have
about the random variable.
R.V. 1,2,5,1,7,5,1,2,1,5,1,4

RANDOM VARIABLE
 We can classificate random variables in two big
sets:
 Discrete random variable: Sample space is a
countable set: integers.
 Continous random variable: All possible values in the
sample space is in a real numbers range.

PROBABILITY MASS/DENSITY
FUNCTION
 All outcomes of the random variable (sample
space) have an associated probability via the
probability mass/density function.
P( X = x )

FUNCTION OF A RANDOM
VARIABLE
 A function of a random variable is another
random variable : Y = g(X)
 Because it is a random variable it has a
probability mass/density function.
 If X is the random variable associated with
rolling a dice:
 We could define the random variable Y = f(X) = 2X.
The possible values of this new random variable are
{2,4,6,8,10,12} with probability 1/6 each. Which is
PY(y)?
 Another function we could define is Y = g(X) = { 1 if x
is even, 0 if x is odd}, which are Sy and PY(y) ?

EXPECTATION AND VARIANCE OF A
RANDOM VARIABLE
 Given X a random variable we define the
expectation as the function mass centre and the
variance as the spread around this centre.

MEAN & VARIANCE OF A RANDOM
VARIABLE
E[X] & VAR[X]
E[X] = μ = Σ x P(X = x)
Var[X] = E[(X- μ)2
] = Σ (x - μ)2
P(X = x)
 Ex: X~bin(7,0.4) function E = expectation( sample_space, pmf )
E = sum( sample_space .* pmf );
end
function V = variance( sample_space, pmf )
E = expectation(sample_space,pmf);
V = sum((sample_space-E).^2 .* pmf);
end
>> n = 7;
>> p = 0.4;
>> S = 0:n;
>> pdf = binopdf(S, n, p);
>> stem(S,pdf);
>> expectation(S,pdf)
ans = 2.8000
>> variance(S,pdf)
ans =1.6800

SKEWNESS & KURTOSIS
 The skewness is a measure of asymmetry of the
probability distribution.
 Is the third standarized moment around the
mean.

SKEWNESS & KURTOSIS
 It is a measure of normality/peakedness of the
random variable.It is the fourth standarized
moment around the mean:μ4/4
 Sometimes it’s defined as μ4/4
-3, it’s a correction
to make the kurtosis of the normal distribution
equal to zero.
 Distributions with 0 excess kurtosis are called
mesokurtic.
 Distributions with positive excess kurtosis are
called leptokurtic.
 Distributions with negative excess kurtosis are
called platykurtic.

SAMPLING DISTRIBUTION
 A sampling distribution is the distribution of a
given statistic (r.v. function) based on a random
sample of size n.
 The sampling distribution depends on the
underlying distribution of the population, the
statistics being considered and the sample size
used.
f
g

SAMPLING DISTRIBUTION
 For example: consider a normal population with
mean μ and variance 2
. Assume we take
samples of size n from this population and
calculate the arithmetic mean for each sample
(sample mean statistic).
 Each sample will have its own average value and
the distribution of these averages will be called
“sampling distribution of the sample mean”.
 This distribution will be normal N(μ, 2
/n).
 For other statistics and other populations the
formulas are frequently more complicated and
sometimes they don’t even exist in closed-form.

CENTRAL LIMIT THEOREM
 Let Sn be the sum of n random variables i.i.d.:
Sn = X1 + …+ Xn (Xi ~ ANY distribution)
E[Sn] = n μ
Var[Sn] = n 2
 We define Zn as:
Zn = (Sn – nμ)/ ( √n)
 As n grows to infinity Zn converges to N(0,1).
 As n grows the kurtosis will converge to 0. So,
the more additions (n) the more normal will be
the random variable.

NUMERICAL DEMONSTRATION
 As N grows the distribution converges to N(0,1).
>> out = sample_mean( 2);
>> kurt(out)
ans =
-0.5745
>> out = sample_mean( 5 );
>> kurt(out)
ans =
-0.1728
>> kurt(out)
ans =
-0.1104
>> kurt(out)
ans =
-0.0802
N = 2 N = 5 N = 20

No matter the source distribution.
NUMERICAL DEMONSTRATION
>> samples = sample_mean(2,'exp');
>> kurt(samples)
ans =
3.1135
>> skewness( samples )
ans =
1.3566
>> kurt(samples)
ans =
1.1265
ans =
0.8995
>> kurt(samples)
ans =
0.6619
ans =
0.7450
>> kurt(samples)
ans =
0.1188
ans =
0.2686
N = 2 N = 5 N = 10 N = 50

SAMPLE MEAN
 Given a sequence of random variables (iid) X1,
…,Xn we define the sample mean as:
Xm = (X1 + …+Xn)/n
 As Xm is a function of random variables then it is
a random variable with a probability
mass/density function associated.
 For a long n (n>30) the probability density is
normal because of the CLT.
 E[Xm] = E[Xi] = μ
 Var(Xm) = Var(Xi)/n = 2
/n
 SD(Xm) = /√n
 Xm ~ N(μ , 2
/n)

 Standard deviation/Uncertainty reduction
around population mean with 1/√n relation.
SAMPLE MEAN

SAMPLE VARIANCE
 Xm ~ N(μ , 2
/n)  This random variable spreads
out around the population mean μ, so it is an
unbiased estimator.
 S2
= 1/(n-1) Σ(x-μ)2
(Cochran’s theorem)
S2
~ 2
/(n-1)χ2
n-1
 Probability
density distribution
of χ2
k = Σi=1
k
Xi
2
,
Xi ~N(0,1)

Given a uniform distribution [-1, 1]  its standard
deviation is  = 1/√3 and its kurtosis is -6/5.
E[S2
]= 2
=1/3
Var(S2
) = 4
(2/(n-1) + Kurt/n)= 4
( 2/(n-1) + μ4/(n4
) )
SAMPLE VARIANCE
%Uniforme de -1 a 1.
samples = unifrnd(-1,1,1,N);
mu = 0;
sigma = sqrt(1/3);
N = 4 N = 20 N = 200

SAMPLE VARIANCE, NUMERICAL
EXAMPLE
>> [smean svar] = sample_mean_var(4,'uni');
>> var(svar)
ans =
0.0410
>> ((1/3)^2)*(2/3 -6/20)
ans =
0.0407
>> var(svar)
ans =
0.0287
>> ((1/3)^2)*(2/4 -6/25)
ans =
0.0289
>> var(svar)
ans =
0.0113
>> ((1/3)^2)*(2/9 -6/50)
ans =
0.0114
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
1
2
3
4
5
6
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
N = 2
N = 3
N = 4
N = 5
N = 6

 Statistical inference is the process of making
conclusions from datasets arising from systems
affected by random variation.
 Inference makes propositions about populations,
using data drawn from the population of interest
via some form of sampling.
INFERENCE
population
sample
We’ll have to deal with
this partial
information!

SAMPLING CONSEQUENCES
 The goal behind inference is to determine
whether an observed effect, such as difference
between two means or the correlation between
two variables , could reasonably be attributed to
the randomness in selecting the sample.
 If not, we have evidence that the effect observed
in the sample reflects an effect that is present in
the population.
Population 1
Sample
Population 2
Sample
?
??
?
? ?

1,3,6,7,2,
0,-3,-5
SAMPLING EXAMPLE
 Imagine we have the following population and we
take a sample of size 3 and compute the sample
mean:
 Remember sampling distribution (theoretical or
empirical)!!
1,3,6,7,2,0,-3,-5
7,6,3
1.37
5.33
So!, we need something to tell us
whether 5.33 makes sense as a
population mean or not!! We need to
answer this questioninference.

CONFIDENCE INTERVAL
 It’s an interval estimate of a population
parameter. Ex: population mean.
 Instead of estimating the parameter by a single
value, an interval likely to include the interval is
given.
 How likely the interval is to contain the
parameter is given by the confidence level.
 Increasing the confidence level will widen the
confidence interval.
 The end points of the confidence interval are
referred to as confidence limits and these ending
points are random variables as well and the
interval is a random variable too.

CONFIDENCE INTERVAL
 The confidence level usually is 0.95 o 0.99. The
intuition behind this number is that if we
repeated the experiment/measure 100 times, 95
of them the interval would hold the true
parameter value.
 Imagine we know population’s standard
deviation  (from previous studies):
 Z = (Xm – μ)/(/√n) ~ N(0,1)
 We can select two symmetrical points -z/2 and z/2 for
which P(-z/2 ≤ Z ≤ z/2 ) = 1-
 P( Xm - z/2 /√n ≤ μ ≤ Xm + z/2 /√n) = 1-
 So the probability of μ belonging to the CI is 1-
 Random confidence interval : [Xm +/- z/2 /√n ]
First, let’s have a look at the
theoretical-oriented sampling
distribution and hence to
theoretical confidence
intervals !!

CONFIDENCE INTERVAL: 1-FZ(X)
TABLE
There are different
tables to express the
same!!

CONFIDENCE INTERVAL
The horizontal line segments represent 100
realisations of a confidence interval for μ.
Acceptance
region
Critical region

CONFIDENCE INTERVAL WITH
KNOWN VARIANCE
 What does P(μ є [Xm +/- z/2 /√n ]) = 1- mean?
 We know that by repetition in 100(1- )% of the cases μ
will be in the calculated interval.
 In practice we only have one repetition of the
experiment and so a single confidence interval. This is
why we rely on our interval to hold the parameter and it
will in 100(1- )% of the cases!
 The CI length is: 2z/2 /√n and our interest will be
to get this interval the shorter as possible.
 The confidence interval length grows as the confidence
level grows and viceversa however we should keep it in
reasonable limits (95%, 99%).

KNOWN VARIANCE
 The estimation will be more accurate as the
population variance is smaller which means that the
population sampling is more homogeneous.
 The length of the interval decreases as we increase
the sampling size.

UNKNOWN VARIANCE
 The confidence interval if we don’t know the
population variance can be estimated with the
sample variance S2
, CI: [Xm +/- tn-1,/2 S/√n ]
T Student
distrib

BOOTSTRAP INTRODUCTION
 The revolution in computing is having a dramatic
influence on statistics
 Statistical study of very large and very complex data
sets becomes feasible (fMRI analysis).
 These methods, bootstrap confidence intervals
and permutation tests apply computing power to
relax some of the conditions needed for
traditional inference (normality).
 The main goal is to compute bias, variability and
confidence intervals for which theory doesn’t
provide closed form solutions (formulas).
 Closed form solutions are replaced by brute force
computing.
Now, let’s have a look at the
data-oriented sampling
distribution !!

 It’s a procedure in which computer simulation
through resampling data replaces mathematical
analysis.
 The big idea behind is: The sample is an estimate
of the population (if large enough), so take the
sample as if it is the population itself.
Population
Sample
estimate
Our real truth about
the population!!

 So let’s treat the data as a proxy for the true
distribution !!
 In Bradley Efron words:
 “Bootstrapping requires very little in the way of
modeling, assumptions, or analysis, and can be
applied in an automatic way to any situation, no
matter how complicated”
 “An important theme is the substitution of raw
computing power for theoretical analysis”

BOOTSTRAP PROCEDURE
 The big idea: Statistical inference is based on the
sampling distributions of sample statistics (Ex:
sample mean). The bootstrap is first of all a way
of finding the sampling distribution from just
one sample. The bootstrap procedure/algorithm:
 Step 1: Resampling: A sampling distribution is based
on many random samples from the population. In
place of many samples from the population, create
many resamples by repeatedly sampling with
replacement from this random sample.
3.12 0.00 1.57 19.67 0.22 2.20
mean = 4.46
1.57 0.22 19.67 0.00 0.22 3.12
mean = 4.13
0.00 2.20 2.20 2.20 19.67 1.57
mean = 4.64
0.22 3.12 1.57 3.12 2.20 0.22
mean = 1.74

BOOTSTRAP PROCEDURE
 Step 2: Bootstrap distribution: The sampling
distribution of a statistic (data function) collects the
values of the statistic from many samples. The
bootstrap distribution of a statistic collects its values
from many resamples.The bootstrap distribution
gives information about the sampling distribution.
Fboot ≈ Fsampling.The sampling distribution is the key object
to answer questions.
 Step 3: Repeat many times steps 1 and 2.
 This basic procedure can be scripted with few
programming lines with any high level language of
your taste (Matlab, R, C++, …) !!

BOOTSTRAP’S PICTURE FOR
SAMPLE MEAN
0 5 10 15 20 25 30 35 40 45 50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30 35 40 45 50
0
0.1
0.2
0.3
0.4
0.5
0.6
Population Sample Resamples
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30 35 40 45 50
0
0.05
0.1
0.15
0.2
0.25
Bootstrap sample
mean distribution
0 5 10 15 20 25 30 35 40 45 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Sample mean
sampling distribution

BOOTSTRAP MATLAB SCRIPT(EASY
ONE)
%The bootstrap method calculates the bootstrap distribution
%of the statistic under study.
%
%sample : Sample data array.
%statistic : handler to the statistic function.
function boot_samples = bootstrap( sample, statistic_handle )
N = 10000;
boot_samples = zeros(1,N);
for j = 1:N
rsample = resample( sample );
boot_samples(j) = statistic_handle( rsample );
end
plot_samples( boot_samples );
end
%resampling the original sample with replacement
function rsample = resample( sample )
n = length( sample );
random_idx = unidrnd(n,1,n);
rsample = sample( random_idx );
end

BOOTSTRAP EXAMPLES
 There are some function in the Matlab statistics
toolbox for bootstraping : bootstrp, bootci.
>> which bootstrp
C:Program FilesMATLABR2007atoolboxstatsbootstrp.m
>> which bootci
C:Program FilesMATLABR2007atoolboxstatsbootci.m
 Let’s do some examples!:
 Bootstraping a correlation coefficient.
 Bootstraping a standard error of the mean.
 Bootstrapping two sample means.
 Bootstrapping the confidence interval of the intercept
regression coefficient.

BOOTSTRAP EXAMPLE:
CORRELATION
 There are some available datasets in matlab to work
with, in the statistical toolbox there are many :
 acetylene.mat : Chemical reaction data with correlated predictors.
 arrhythmia.mat : Cardiac arrhythmia data from the UCI machine
learning repository.
 cities.mat : Quality of life ratings for U.S. metropolitan areas.
 lawdata.mat : Grade point average and LSAT scores from 15 law
schools.
 …
 We will work with lawdata.mat which has two
variables GPA (Grade Point Average) & LSAT (test
designed to measure skills that are considered
essential for success in law school).
 Are these two variables related in some way in a law
school??

BOOTSTRAP EXAMPLE:
CORRELATION
>> load lawdata
>> plot_hist(lsat,'b',10); plot_hist(gpa,'r',10);
>> mean(lsat)
ans =
600.2667
>> mean(gpa)
ans =
3.0947
540 560 580 600 620 640 660 680
0
0.005
0.01
0.015
0.02
0.025
0.03
LSAT
2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
0
0.5
1
1.5
2
2.5
3
GPA

BOOTSTRAP EXAMPLE:
CORRELATION
>> load lawdata
>> plot_hist(lsat,'b',10); plot_hist(gpa,'r',10);
>> scatter(gpa,lsat,'filled');
>> [r p] = corrcoef(lsat,gpa)
r =
1.0000 0.7764
0.7764 1.0000
p =
1.0000 0.0007
0.0007 1.0000
>> bsamples = bootstrap({@(x,y) submat(corrcoef(x,y),1,'1',2,'2'), @stat_resample},95,gpa,lsat);
Confidence interval [0.4584 0.96127]
>> plot_samples( bsamples );
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
0
0.5
1
1.5
Standard Normal Quantiles
QuantilesofInputSample
QQ Plot of Sample Data versus Standard Normal
0.0007 < 0.05
0 is not in the ci  So these two
variables are positively correlated!
2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
540
560
580
600
620
640
660
680
Observed statistic
540 560 580 600 620 640 660 680
0
0.005
0.01
0.015
0.02
0.025
0.03
2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5
0
0.5
1
1.5
2
2.5
3

 To estimate the P-value for a test of significance
we have to estimate the sampling distribution of
the statistic when the null hypothesis is true
So we have to resample in a manner that is
consistent with H0.
BOOTSTRAP EXAMPLE:
CORRELATION
>> ci = bootci(10000,{@(x,y) submat(corrcoef(x,y),1,'1',2,'2'),lsat,gpa},'type','per')
ci =
0.4594
0.9606
With Matlab built-in functions (bootci) and the
same method (percentile) the result is almost the
same.

BOOTSTRAP EXAMPLE :
CORRELATION
 To resample in a manner that is consistent with
the null hypothesis we will merge groups &
randomly resample with replacement.
Control Treatment
3.12 0.00 1.57 19.67 20.22 18.20
Mc = 1.563 Mt= 19.36
Merge
3.12 0.00 1.57 19.67 20.22 18.20
Mc = 10.46
Resample consistent with H0
Control Treatment
20.22 18.20 0.00 1.57 3.12 19.67
Mc = 12.80 Mt = 8.12
Control Treatment
3.12 19.67 20.22 18.20 0.00 1.57
Mc = 14.33 Mt = 6.59
Control Treatment
3.12 20.22 1.57 18.20 0.00 19.67
Mc = 8.30 Mt = 12.62
To resample in a way that is consistent
with the null hypothesis:
imitate many repetitions of the
random assignment of subjects to
“treatment” and “control” groups.

BOOTSTRAP EXAMPLE :
CORRELATION
>> h0bsamples=bootstrap({@(x,y) submat(corrcoef(x,y),1,'1',2,'2'),
@stat_resample_asH0},95,gpa,lsat); >> plot_hist( h0bsamples, 'b' );
>> hold on
>> plot_hist( bsamples, 'r' );
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
0.5
1
1.5
2
2.5
3
3.5
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
3
3.5
Observed statistic
Sampling distrib | H0
Sampling distrib | H1
Now we know how the
distribution of our
statistic looks like in
both cases.

BOOTSTRAP EXAMPLE :
CORRELATION
 The desired P-value is then estimated as:
 P-value = {t* > t*obs}/B , B = #iterations
 If P-value <   Reject H0.
 If P-value >   Accept H0.
 Matlab simple script to get this value:
%P-value of the statistic given a H0 sampling distribution.
function H0 = pvalue( bsamples, observed_stat, alpha )
p_value = sum( bsamples > observed_stat)/length(bsamples);
H0 = p_value > alpha;
end
>> [H0 p_value] = pvalue( h0bsamples, 0.776, 0.05 )
H0 =
0
p_value =
6.0000e-004
>> [r p] = corrcoef(lsat,gpa)
r =
1.0000 0.7764
0.7764 1.0000
So, we reject H0 and can say that
these two variables are statistically
correlated with a confidence level of
0.05!!

BOOTSTRAP EXAMPLE : STANDARD
ERROR OF THE MEAN (SEM)
 We will start from a known situation in which we
know from theory the value of the standard error
of the mean and we will estimate the same value
by bootstrapping from a sample.
 Supose we have a population following the model
~N(25,4.5)
 We take 5000 samples from this population.
5 10 15 20 25 30 35 40 45
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
>> samples = normrnd(25,4.5,5000,1);
>> plot_hist( samples );
>> mean(samples)
ans =
25.0568
>> std(samples)
ans =
4.4253
Sample mean estimate !

BOOTSTRAP EXAMPLE : SEM
 From theory we kown that : SEM = /√n =
4.5/sqrt(5000) = 0.0636
 As we can see the true value (25) belongs to the
confidence interval, as it should!
>> bsamples = bootstrap({@mean, @stat_resample},95,samples);
Confidence interval [24.9338 25.18].
>> plot_norm_samples( bsamples );
24 24.2 24.4 24.6 24.8 25 25.2 25.4 25.6 25.8 26
0
2
4
6
8
-4 -3 -2 -1 0 1 2 3 4
24.8
25
25.2
25.4
25.6
Standard Normal Quantiles
QuantilesofInputSample
QQ Plot of Sample Data versus Standard Normal

BOOTSTRAP EXAMPLE : SEM
 Let’s make an estimation of the standard error of
the mean by calculating the sample standard
deviation of the bootstrap sampling distribution
for the sample mean statistic. Ok,wait and
repeat it again in your head …
 It’s not so different from the theoretical value:
0.0636!!. To sum up:
>> std( bsamples )
ans =
0.0631
>> 4.5/sqrt(5000)
ans =
0.0636
>> std(samples)/sqrt(5000)
ans =
0.0626
>> std( bsamples )
ans =
0.0631
Theoretical value
Traditional statistics
value.
Bootstrap value.

BOOTSTRAP EXAMPLE : TWO
SAMPLE MEANS
 Imagine that we have two different samples: The
hypothesis test that we have to answer is
whether these two samples are taken from the
same distribution or not.
 H0 : F = G
 H1 : F ≠ G
Population
F = G
Sample1
Sample2
Population
F
Sample1
Sample2Population
G

SAMPLE MEANS
 If the distribution is parametrical we can
formulate in terms of the parameters: F = F(x;θ1),
G = F(x,θ2) (Ex: θ is expectation)
 H0 : θ1 = θ2
 H1: θ1 ≠ θ2
0 5 10 15 20 25 30 35 40 45 50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Sample1 Sample2
0 5 10 15 20 25 30 35 40 45 50
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Sample1 Sample2
Non-effect
scenario
Effect
scenario

SAMPLE MEANS
>> csamples = normrnd( 25, 3, 50, 1);
>> tsamples = normrnd( 26, 3, 100, 1);
>> plot_hist( csamples );
>> plot_hist( tsamples, 'r' );
>> mean(tsamples) - mean(csamples)
ans =
1.0648
>> h0bsamples = bootstrap({@(x,y)(mean(x)-mean(y)), @stat_resample_asH0},tsamples, csamples);
>> bsamples = bootstrap({@(x,y)(mean(x)-mean(y)), @stat_resample},tsamples, csamples);
>> [h0 p_value] = pvalue( h0bsamples, mean(tsamples) - mean(csamples), 0.05 )
h0 =
0
p_value =
0.0261
>> [fiho,xiho] = ksdensity( h0bsamples, 'npoints',500 );
>> plot( xiho, fiho );
10 15 20 25 30 35 40
0
0.05
0.1
0.15
0.2
10 15 20 25 30 35 40
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Is this difference significant?
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
sample mean difference | H0
Sample mean difference| H0
1.0648
The smaller the p-value, the
stronger is the evidence for H0 to
be false. In some texts p-value is
called the Achieved Significance
Level (ASL): P( Tboot> tobs | H0)
As 0.026 < 0.05 we reject H0

BOOTSTRAP EXAMPLE :
REGRESSION COEFFICIENT
CONFIDENCE INTERVAL
 We will try to answer the following question:
 How many points in LSAT test should someone
improve to add 1 point to the GPA value??
 How can we answer to this question?
 With the slope coefficient of a regression fit!
 So we can write GPA as a linear function of LSAT
values: x:LSAT, y = GPA
 y = ax +b y(x+c) – y(x) = 1  ax+ac+b-ax-b = 1 
 ac = 1  c = 1/a  So the inverse of the slope of the
regression line will be our number!

BOOTSTRAP EXAMPLE :
CONFIDENCE INTERVAL
>> load lawdata;
>> x = [ones(size(lsat)) lsat];
>> y = gpa;
>> b = regress(y,x)
b =
0.3794
0.0045
>> yfit = x*b;
>> scatter(x(:,2),y);
>> hold on;
>> plot( x(:,2), yfit,'r');
>> plot( x(:,2), yfit,'or');
Design matrix
540 560 580 600 620 640 660 680
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
GPA
LSAT
>> resid = y - yfit;
>> plot_hist( resid, 'b', 5 );
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Residuals, we will assume
white noise. We will
bootstrap this values into
the regression model!

BOOTSTRAP EXAMPLE :
CONFIDENCE INTERVAL
>> bsamples = bootstrap({@(rsamples) submat(regress(yfit+rsamples,x),1,'2'), @stat_resample},95,resid);
Confidence interval [0.0026841 0.0064508].
>> mean(bsamples)
ans =
0.0045
>> 1/0.0045
ans =
222.2222

III – MULTIVARIATE METHODS
PCA

INTRODUCTION
 PCA is a mathematical method to find out
interesting directions in data clouds.
 These interesting directions are called principal
components.
 Later on we will give an interpretation for
interesting. By now are just somehow interesting.

PCA INTRODUCTION
 Given a data table with variables in columns and
observations in rows, the data cloud is the set of
points resulting from reading each row as a
vector.
-10
-5
0
5
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6

VECTORS
 Do you remember what a vector is?
 Do you remember what the euclidean norm of a
vector is?
 Do you remember cartesian and polar
coordinates of a vector?
 Do you remember what a scalar product is
between two vectors?.

VECTOR SPACE
 A vector space is the whole set of vectors spanned
by the vector space basis.
 Which means that any vector x in the vector
space can be expressed as a linear combination
of basis elements (basis vectors) where mk are
scalars and uk are the basis elements (vectors).
 x = m1u1 + …+ mnun
 To span means that you generate any vector by
adding and multiplying by a scalar the basis
vectors.

VECTOR PROJECTION
 Imagine we have vector p and v and want to
project p into v:
 pv is the nearest point to p in the v direction.
 <v,p> = |v| |p| cos()
 |pv| = |p| cos()
 <v,p> = |v|| pv | | pv | = <v,p>/|v|
 pv = | pv | v/|v|  pv = <v,p>/|v| v/|v|
<v,p>/<v,v> v
v
p
pv

VECTOR PROJECTION
0
0.5
1
1.5
2
2.5
3
-1
-0.5
0
0.5
1
0
1
2
x
y
z
>> o = [0 0 0];
>> p0 = [2,1,2];
>> p1 = [3 -1 1];
>> vectarrow(o,p0);
>> hold on;
>> vectarrow(o,p1);
>> p0_p1 = (p0.*p1)/(p1.*p1) * p1;
>> vectarrow(o,p0_p1);
p0
p1
p0_p1

VECTOR BASIS
 Suppose we have a vector x expressed in two
different basis B1 & B2
 B1 = { u1 ,u2 ,u3 ,…, un } , B2 = { v1 ,v2 ,v3 ,…, vn }
 xB1 = (m1, m2, …, mn)
 xB2 = (n1, n2, …, nn)
 x = m1u1 + …+ mnun
 x = n1v1 + …+ nnvn
 We can write ui vectors in B1 as linear
combination of B2 vectors:
 u1 = a11 v1 + a12 v2 + ... + a1n vn
 u2 = a21 v1 + a22 v2 + ... + a2n vn
 …

VECTOR BASIS
 So:
 u1B2
= (a11, a12, a13, …, a1n)
 unB2
= (an1, an2, an3, …, ann)
 We can substitute these new vectors in x:
 x = m1(a11 v1 + a12 v2 + ... + a1n vn) + …+ mn(an1 v1 + an2 v2
+ ... + ann vn)
 x = (m1a11 + m2a21 + ... + mnan1) v1 + …+ (m1a1n + m2a2n + ...
+ mnann) vn
 By inspection we can see that:
 n1 = m1a11 + m2a21 + ... + mnan1
 n2 = m1a12 + m2a22 + ... + mnan2
 …

VECTOR BASIS
 In matrix form:
 xB2 = A xB1
 But, how do we calculate these aij numbers?
 Solving a linear system of equations.
 If the basis vectors are orthogonal projecting ui over
vj vector:
 aij = < ui , vj >/< vj , vj > = uixvjx + uiyvjy

VECTOR BASIS: EASY EXAMPLE 1
 We have a basis B={(2,1), (1,4) } and a vector
x=(5,6) in canonical coordinates. We want to get
the vector x expressed as a vector in B basis.
 x=(5,6)  xB = (a,b)
 (5,6) = a(2,1) + b(1,4)
5 = 2a + b -20 = -8a – 4b
6 = a + 4b  6 = a + 4b  -14 =-7a  a = 2
b = 1
So, xB = (2,1)

VECTOR BASIS: EASY EXAMPLE 2
 We have a vector x = (2,3)
 B1 : { (1,0) , (0,1) }
 x = 2(1,0) + 3(0,1)
 B2 : { (-1,0) , (0,2) }
 x = -2 (-1,0) + 3/2 (0,2)  xB2 = (-2,3/2)
 B3 : { (2,1), (-1,2) }
 x = 7/5 (2,1) + 4/5 (-1,2)  xB3 = (7/5,4/5)
B1 B2 B3

EIGENVECTORS & EIGENVALUES
 Suppose A= , then A =
 So Av is a reflected vector around y axis.
-1 0
0 1
x1
x2
-x1
x2
x1
x2
-x1
x2 v =

 We observe that:
A = -1
A = 1
 Thus, the vectors on the coordinate axes are
mapped to vectors on the coordinate axes. For
those vectors exists a scalar λ so that:
A = λ
x1
0
x1
0
0
x2
0
x2
x1
x2
x1
x2
The direction of the vector
remains invariant to the
transformation A !

 Example:
Suppose A =
Then is an eigenvector corresponding to
eigenvalue λ = 4.
= = 4
=
2 3
2 1
3
2
2 3
2 1
3
2
12
8
3
2
2 3
2 1
1
3
11
5
Is it an eigenvector?

 To sum up with a picture:
Which one is the
eigenvector?
Eigenvectors are vectors
which direction is
invariant to the
transformation matrix!

OBJECTIVE FUNCTION.
 A function associated with an optimization
problem which determines how good a solution is.
 PCA as an optimization problem:
 We want to get the direction of the projection that
maximizes the variance of the projected values which
means getting the direction in which much variance
lies on.
 w1 = arg maxw { var(wxt
) } where w is a projection row
vector and x is the data matrix with each row as a
point. Which is the same as minimizing:
θ1 = arg max θ { var( (cos θ,sinθ)xt
) }
 w1 is the principal component of our data cloud.

 This figure shows the projection of the data cloud
to the w direction. The histogram of the projected
values is plotted in blue..
 Remember! :
xt
: Data cloud.
w: Projection direction.
wxt
: Projected cloud.
So, which is best w?
 arg maxw{ var(wxt
) }
DATA CLOUD PROJECTION
w

 Let’s do the same with several projection vectors.
 We get different distributions for projected data.
 In red you can see the
direction of maximum
variance:
w* = maxw{ var(wxt
) }
 For the second w2
component we would
look for maximizing
the same function but
with a restriction: orthogonal to w1 .
DATA CLOUD PROJECTION
w*

PCA: ALGORITHM
 Remove mean from data.
 Calculate covariance matrix.
 SVD decomposition: A way to get eigenvalues &
eigenvectors of the covariance matrix.
 Select eigenvectors sorted by eigenvalue.
 Project to this new space to reduce
dimensionality.
 Project back to original space.
 Add mean to data.

PCA GEOMETRY &
DIMENSIONALITY REDUCTION

PCA & EEG
 We will illustrate PCA to denoise EEG data. The
algorithm showed here should be improved but
it’s a naive approximation to understand PCA.
 Each epoch/sweep can be seen as an N dimension
vector where N is the number of timepoints.
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40

PCA & EEG
 All sweeps for my condition build a data cloud in
N-space.
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
D
ata
cloud

PCA & EEG
 As now we’re dealing with vectors and not with
functions any more we can apply a PCA method
on them.
0 20 40 60 80 100 120 140 160 180
-20
-10
0
10
20
30
40
>> [trials timeline] = gettrials( 'c:s2-b1.cnt', 2, 5, 0.30 );
>> plot(trials(15,:));
>> [dtrials F] = denoise_pca( trials );
>> plot(dtrials(15,:),'r');

PCA & EEG
 We can plot also some principal components of
this set of trials:
-50 0 50 100 150 200 250 300
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
-50 0 50 100 150 200 250 300
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
-50 0 50 100 150 200 250 300
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2

PCA & EEG
function [depochs F] = denoise_pca( epochs )
dim = size(epochs);
%numero de muestras
N = dim(1);
%numero de variables
M=dim(2);
C = cov( epochs );
[V D] = eig( C );
cpower = diag(D);
%Reducimos la dimensionalidad a numPC tpower = sum(cpower) * 0.85;
tpower = sum(cpower) * 0.85;
cpower = cpower(end:-1:1);
cpower = cumsum(cpower);
numPC = sum( cpower <= tpower )
F = V(:,[M:-1:M-numPC+1]);
depochs = (F * F' * epochs')';
end

III – MULTIVARIATE METHODS
ICA

INTRODUCTION
ICA : If we listen to someone speaking about ICA,
whats all about?
 It’s a mathematical model.
 It’s a set of algorithms (fastICA, infomax).
 ICA is a method to decompose a set of signals
into a set of statistical independent
components (time or space).
Mixing
matrix
s1
s2 = a21s1 + a22s2
= a11s1 + a12s2
a11
a12
a21
a22

INTRODUCTION
 In matrix form:
 X = A S
 It seems complex because we only know X, so
we’ll need to do some assumptions on S.
Mixed signals
Source
signals
Mixing
matrix

INTRODUCTION : ICA IN EEG
WORDS
 From some EEG signals (xi) the problem is to
guess the mixing matrix A and components (si).
 The model assumes no delay and no distortion
through the environment.
0 500 1000 1500 2000 2500 3000 3500 4000
-80
-60
-40
-20
0
20
40
60
0 500 1000 1500 2000 2500 3000 3500 4000
-100
-80
-60
-40
-20
0
20
40
60
80

INTRODUCTION
 To get the blind source separation work we need
to be aware of some hints, restrictions and
pitfalls:
 The source signals need to be non-gaussian ! We’ll
see why. Just one source can be gaussian because the
sum of two gaussian random variables is another
gaussian random variable.
 The source strength cannot be estimated because the
ambiguity of the model: So, normalization to unit
variance:
 x1 = a11 s1 + a12 s2  so, we can always write:
 x1 = (a11/) (s1) + a12 s2
 x1 = (a11/λ) (λs1) + a12 s2
 with numbers:
 10 = 2 * 3 + 1*4 ; 10 = 1 * 6 + 1*4 ; 10 = 3 * 2 + 1*4

INTRODUCTION
 The implicit hypothesis when we apply ICA is
that the source signals are mixed linearly.
x = As ,
Mixing matrix A:
A = [1 -2; -1 1 ] ;
Source signals s:
s = [ 4 -3 2; 9 4 1]
Mixed signals x:
x = As = [-14 -11 0; 5 7 -1]
 In the real world we will try to estimate the
unkwon mixing matrix A with just knowing x.
0 0.5 1 1.5 2 2.5 3 3.5 4
-4
-3
-2
-1
0
1
2
3
4
5
0 0.5 1 1.5 2 2.5 3 3.5 4
0
1
2
3
4
5
6
7
8
9
10
0 0.5 1 1.5 2 2.5 3 3.5 4
-15
-10
-5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
-2
-1
0
1
2
3
4
5
6
7
8
=A

INDEPENDENCE &
UNCORRELATION
 But, what is independence ? & uncorrelation?
 Independence is the property that independent
events have in a probability framework. But…What
are independent events? In order to answer this
question another mathematical concept must be
introduced: conditional probability.
 Conditional probability of an event is the probability
of that event when other event has been observed.
p(A|B) = p(A,B)/p(B) when A,B not disjoint.
p(A|B) = p(A) when A,B disjoint.

INDEPENDENCE &
UNCORRELATION
 Let’s have a look at the following data cloud and
remember the following expression when X,Y are
independent:
P(x|y) = P(x)
P(x,y) = P(x)P(y)
 This means that the probability distribution or
”shape” of x does not depend on any given y.
 The joint probability is factorized by the
marginal probabilities.

CONDITIONAL PROBABILITY
 Dados dos eventos podemos definir la
probabilidad de B dado que se ha observado A y
se escribe p(B|A).
 p(B|A) = p(A,B)/p(A)
 It’s to say, conditional
probability is the joint of
both events divided by
probability of event A.
 In general p(B|A) ≠ p(A|B)
S
A
B

CONDITIONAL PROBABILITY
 Let’s do a simple example with a dice.
 S = {1,2,3,4,5,6} ; A={1,4} ; B={2,4,6} (even number)
 Everytime we toss the dice and the outcomes 1 or 4
are observed we can say that the event A has
ocurred.
 Everytime we toss the dice and the outcomes 2,4 or 6
are observed we can say that event B has ocurred.
 Assuming a fair dice p(i) = 1/6
 p(A) = 1/3
 p(B) = 1/2
 p(B|A) = p(B,A)/p(A) = (1/6)/(1/3) = 1/2
 p(A|B) = p(B,A)/p(B) = (1/6)/(1/2) = 1/3
3 5
6 21 4

INDEPENDENCE WITH 4 NUMBERS
 Let’s do an exercice of statistical independence
with just 4 numbers !
 Given the following contingency table:
 Are x,y independent?? We have to check that
p(x,y) = p(x)p(y) forall x,y
 The very first check fails: p(x1,y1) ≠ p(x1)p(y1)
0.1 ≠ 0.12
P(x,y)
y1 y2 P(x)
x1 0.1 0.1 0.2
x2 0.5 0.3 0.8
P(y) 0.6 0.4

INDEPENDENCE WITH 4 NUMBERS
 Other way to think about it:
 p(y|x=x1) = {1/5 1/5}
 p(y|x=x2) = {5/8 3/8}
 p(y) = {6/10 4/10}
 As we can see p(y|x1) ≠ p(y|x2) ≠ p(y) and so are not
independent because the probability distribution of y depends
on x.
 ¿Any questions?

INDEPENDENCE, UNCORRELATION
& DATA CLOUDS
 Simple example of independent source signals
and the results of the mixing process:
 Imagine an independent random data cloud (R2
)
with uniform distribution for both variables.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200
-20 0 20 40 60 80 100 120
0
100
200
%Create two random uniform
distributed
samples
>> s = 99*rand(2,1000);
>> plot(s(1,:),s(2,:), '.');

& DATA CLOUDS
 Bivariate scatterplot from two normal
independent distributions.
-500 -400 -300 -200 -100 0 100 200 300 400 500
-600
-400
-200
0
200
400
600
-600
-400
-200
0
200
400
600
0 100 200
-500 -400 -300 -200 -100 0 100 200 300 400 500
0
100
200
%Create two random normal distributed
samples
>> s = 99*randn(2,1000);
>> plot(s(1,:),s(2,:), '.');
Can you see any direction
along which the fitting
error gets minimized? All
possible directions get the
same error variance.

& DATA CLOUDS
 Bivariate scatterplot from two exponential
independent distributions.
 All curves along y axis are scaled versions of the
same distribution ! : p(x,y) = p(y)p(x)
x
y
-200 0 200 400 600 800 1000 1200
-200
0
200
400
600
800
1000
1200
-200
0
200
400
600
800
1000
1200
0 200 400
-200 0 200 400 600 800 1000 1200
0
200
400
%Exponential indep. random samples
>> s = 99* exprnd(1,2,1000);
>> plot(s(1,:),s(2,:), '.');
x
y

& DATA CLOUDS
 In last figures can you see any unique direction
along which you could draw a line?
This is because all of them are
independent  uncorrelated.
 The first signal in each plot has nothing to do
with the second one.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200
-20 0 20 40 60 80 100 120
0
100
200 -500 -400 -300 -200 -100 0 100 200 300 400 500
-600
-400
-200
0
200
400
600
-600
-400
-200
0
200
400
600
0 100 200
-500 -400 -300 -200 -100 0 100 200 300 400 500
0
100
200
-200 0 200 400 600 800
-200
0
200
400
600
800
1000
1200
-200
0
200
400
600
800
1000
1200
0 200 400
-200 0 200 400 600 800
0
200
400
>> s = 99*randn(2,1000);

& DATA CLOUDS
 Let’s see a dependent joint probability
distribution scatterplot:
s1 = 99*rand(1,npoints);
%s2 values are dependent on s1 values !
for i=1:npoints
if(s1(i) > 45)
s2(i) = 99*normrnd(0,1);
else
s2(i) = 99*exprnd(1);
end
end
s = [s1; s2];
-20 0 20 40 60 80 100 120
-600
-400
-200
0
200
400
600
800
1000
-600
-400
-200
0
200
400
600
800
1000
0 200 400
-20 0 20 40 60 80 100 120
0
200
400
>> corrcoef(s(1,:), s(2,:) )
ans =
1.0000 -0.3666
-0.3666 1.0000
Although it seems to fit a
linear model the truth
behind the scatterplot is
that it’s not.

& DATA CLOUDS
 So, p(x,y) ≠ p(x) p(y) because y distribution
depends on x and viceversa.
 We get two different
shapes, exponential
and gaussian with
this two cuts:
P(20,y) ~ Exp.
P(80,y) ~ Normal.
x
y

& DATA CLOUDS
 Mix these two independent signals with a mixing
matrix, substract mean and plot.
%Create two mixed random normal
distributed samples
>> s = 99*rand(2,1000);
>> A = [0.54 -0.84;0.12 -0.27];
>> x = A*s;
>> x_m = repmat( mean(x,2), 1, npoints );
>>x = x - x_m;
-100 -50 0 50 100
-30
-20
-10
0
10
20
30
-30
-20
-10
0
10
20
30
0 100 200 300
-100 -50 0 50 100
0
100
200
300
Can you see any
interesting direction in
this data cloud?

& DATA CLOUDS
 Some points to notice:
 A maximum variance direction in the mixed data can
be seen.
 Marginal histograms have changed: are more
gaussian! , 3 letters should come to your mind: CLT.
 We can see the mixing row vectors as the edges of the
mixed data cloud.
 Data lost uncorrelation and independence.
-100 -50 0 50 100
-30
-20
-10
0
10
20
30
-30
-20
-10
0
10
20
30
0 100 200 300
-100 -50 0 50 100
0
100
200
300
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200
-20 0 20 40 60 80 100 120
0
100
200

& DATA CLOUDS
 CLT review: Imagine we have two dice and we
sum the outcome of them, we get the following
distribution:
>> dicesums = perm(1:6,1:6);
>> dicesums = reshape( dicesums, 1,prod(size(dicesums)));
>> [n c] = hist( dicesums,2:12 );
>> n = n ./sum(n);
>> stem(2:12,n);
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
2 3 4 5 6 7 8 9 10 11 12
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18 The sum becomes more
gaussian!

& DATA CLOUDS
 What would happen if rows of the mixing matrix
were orthogonal?.
 As we can see data remains uncorrelated but
independence is lost.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200 300
-20 0 20 40 60 80 100 120
0
100
200
300
-100 -80 -60 -40 -20 0 20 40 60 80 100
-60
-40
-20
0
20
40
-60
-40
-20
0
20
40
0 100 200 300
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
100
200
300
>> corrcoef(x')
ans =
1.0000 -0.0129
-0.0129 1.0000

& DATA CLOUDS
 What we will do now is whitening or sphering
our data. The goal of this process is:
 Uncorrelate data.
 Scale the variances of all variables to have unit
variance.
-20 0 20 40 60 80 100 120
-20
0
20
40
60
80
100
120
-20
0
20
40
60
80
100
120
0 100 200 300
-20 0 20 40 60 80 100 120
0
100
200
300
-100 -50 0 50 100
-30
-20
-10
0
10
20
30
-30
-20
-10
0
10
20
30
0 100 200 300
-100 -50 0 50 100
0
100
200
300
-4 -3 -2 -1 0 1 2 3 4 5
-4
-3
-2
-1
0
1
2
3
4
5
-4
-3
-2
-1
0
1
2
3
4
5
0 100 200 300
-4 -3 -2 -1 0 1 2 3 4 5
0
100
200
300
Independent Correlated
Uncorrelated &
unit variance
Mixing whitening

WHITENING OPERATION
 Whitening is the mathematical opeation by
which data gets uncorrelated and unit variance.
 This operation is also called sphering.
 So, the covariance matrix after the
transformation gets transformed into an identity
matrix.

WHITENING OPERATION
 We can see the whitening operation as a linear
transformation T that gets data uncorrelated:
 x’=Tx , E[x’x’t
] = I E[Txxt
Tt
]TE[xxt
]Tt
=I
T = E[xxt
]-1/2
where E[xxt
] = cov(x) when data is
zero mean.
 Whitening operation is defined to be:
x’ = E[xxt
]-1/2
x

WHITENING OPERATION
 Whitening operation on a 10 channel EEG data.
 Covariance matrix before and after whitening.
%Covariance images
>> addpath functionsprob
>> cnt = ldcntb('c:s2-b1.cnt');
>> C = cov(cnt.dat);
>> imagesc( C );
>> wx = whiten(cnt.dat');
>> C = cov( wx' );
>> imagesc( C );
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
function wx = whiten( x )
npoints = length(x);
x_m = repmat( mean(x,2), 1, npoints );
x = x - x_m;
C = cov(x');
wx = inv(sqrtm(C))*x;
end

& DATA CLOUDS
 After the sphering/whitening procedure applied
on the mixed signals we are half the way of
getting the independent components. Just need
to rotate de data cloud. But...rotate up to what
point? To achieve the maximum of a certain cost
function, in ICA case, the nongaussianity!
 Minimizing gaussianity = Maximizing nongaussianity.
 The marginal probabilities will go away from the
normal distribution. We are walking backwards the
CLT.
rotation

 Now, we will project the whitened data cloud in
many different directions and will calculate the
kurtosis of projected points.
0 20 40 60 80 100 120 140 160 180
1
2
3
4
5
6
7
& DATA CLOUDS
%Kurtosis for different projection angles.
>> wx = whiten( x );
>> plot(wx(1,:),wx(2,:), '.');
>> f = costfunction( @kurt, wx );
>> plot(f)
-8 -6 -4 -2 0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
-8
-6
-4
-2
0
2
4
0 100 200 300
-8 -6 -4 -2 0 2 4 6 8 10
0
100
200
300
≈ 135≈45

DATA PROJECTION REVIEW
 Histogram based on projected data points for
phase angle 0.
 So, we can calculate any statistic on this
distribution.


function cf = costfunction( f, data )
%1 degree in radiants.
alpha = pi/180;
%Number of datapoints
npoints = length(data);
%Rotation matrix
R = [cos(alpha) -sin(alpha); sin(alpha) cos(alpha)];
cf = zeros(1,180);
%Initial projection vector
pvector = [1;0];
for i=1:180
%Projection vector gets rotated
pvector = R*pvector;
%Project data onto projection vector
pdata = dot( repmat(pvector,1,npoints) , data);
%Calculate statistic and store value
cf(i) = f( pdata );
end
end
& DATA CLOUDS
-8 -6 -4 -2 0 2 4 6 8 10
-8
-6
-4
-2
0
2
4
-8
-6
-4
-2
0
2
4
0 100 200 300
-8 -6 -4 -2 0 2 4 6 8 10
0
100
200
300

& DATA CLOUDS
 If we repeat the procedure for a different sample
we get different shapes. One thing to notice is
that although the cost function have different
shapes, the two maximums are placed at the
same values!
0 20 40 60 80 100 120 140 160 180
1
2
3
4
5
6
7
0 20 40 60 80 100 120 140 160 180
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
0 20 40 60 80 100 120 140 160 180
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
-200 0 200 400 600 800 1000
-200
0
200
400
600
800
-200
0
200
400
600
800
0 100 200 300
-200 0 200 400 600 800 1000
0
100
200
300
-200 0 200 400 600 800 1000
-200
0
200
400
600
800
-200
0
200
400
600
800
0 100 200 300
-200 0 200 400 600 800 1000
0
100
200
300

 Let’s bootstrap the whole process to get an
insight of how it behaves for different resamples:
 500 kurtosis functions calculated from resampling for
projection angles from 0 to 180
INDEPENDENCE, UNCORRELATION,
DATA CLOUDS & BOOTSTRAP
-200 0 200 400 600 800 1000
-200
0
200
400
600
800
-200
0
200
400
600
800
0 100 200 300
-200 0 200 400 600 800 1000
0
100
200
300
mean
variance
0 20 40 60 80 100 120 140 160 180
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180
3
3.5
4
4.5
5
5.5
6
6.5
7
0 20 40 60 80 100 120 140 160 180
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

& DATA CLOUDS
 With the same procedure can be checked why
ICA doesn’t like Gaussian data, the problem
comes when trying to calculate the rotation angle
for Gaussian data:
-6 -4 -2 0 2 4 6
-4
-2
0
2
4
6
-4
-2
0
2
4
6
0 100 200 300
-6 -4 -2 0 2 4 6
0
100
200
300
0 20 40 60 80 100 120 140 160 180
-0.15
-0.1
-0.05
0
0.05
0.1
0 20 40 60 80 100 120 140 160 180
-0.35
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
-6 -4 -2 0 2 4 6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
0 100 200 300
-6 -4 -2 0 2 4 6
0
100
200
300
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
0 100 200 300
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
0
100
200
300
0 20 40 60 80 100 120 140 160 180
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
There’s no
structure in the
data

INDEPENDENCE,
UNCORRELATION , DATA CLOUDS &
BOOTSTRAP
 Bootstrap of the bivariate normal data cloud for
kurtosis:
-6 -4 -2 0 2 4 6
-4
-2
0
2
4
6
-4
-2
0
2
4
6
0 100 200 300
-6 -4 -2 0 2 4 6
0
100
200
300
0 20 40 60 80 100 120 140 160 180
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0 20 40 60 80 100 120 140 160 180
0
0.02
0.04
0.06
0.08
0.1
0.12
0 20 40 60 80 100 120 140 160 180
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
x 10
-3
mean
variance

& DATA CLOUDS
 Whole process with exponential (s1) and
gaussian (s2) data:
-500 -400 -300 -200 -100 0 100 200 300 400 500
-200
0
200
400
600
800
1000
1200
-200
0
200
400
600
800
1000
1200
0 100 200 300
-500 -400 -300 -200 -100 0 100 200 300 400 500
0
100
200
300
-1200 -1000 -800 -600 -400 -200 0 200 400 600
-350
-300
-250
-200
-150
-100
-50
0
50
100
150
-350
-300
-250
-200
-150
-100
-50
0
50
100
150
0 100 200 300
-1200 -1000 -800 -600 -400 -200 0 200 400 600
0
100
200
300
-10 -8 -6 -4 -2 0 2 4 6
-6
-4
-2
0
2
4
6
-6
-4
-2
0
2
4
6
0 100 200 300
-10 -8 -6 -4 -2 0 2 4 6
0
100
200
300 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
-10
-5
0
5
-10
-5
0
5
0 100 200 300
-5 -4 -3 -2 -1 0 1 2 3 4 5 6
0
100
200
300
Cannot recover
sign and order !
Mixing
whitening
rotation

ICA & PCA WITH IMAGES
 PCA is not able to recover independence:
NonGaussian

EXAMPLE WITH EEG SIGNALS
 Up to now we fixed the mixing matrix and source
signals. With the forward model (x=As) we got
the mixed signals. Now we face the inverse
problem. From two EEG signals the problem is to
to guess the mixing matrix A.
 Let’s have a look at the scatterplot.
0 500 1000 1500 2000 2500 3000 3500 4000
-80
-60
-40
-20
0
20
40
60
0 500 1000 1500 2000 2500 3000 3500 4000
-100
-80
-60
-40
-20
0
20
40
60
80
%Channel 2 & 3
cnt = ldcntb( 'c:s2-b1.cnt' );
s1 = cnt.dat(1:4000,2);
s2 = cnt.dat(1:4000,3);
s = [s1 s2]';

 It seems that these two channels are correlated.
 The next step would be to whiten the data.
-100 -50 0 50 100
-150
-100
-50
0
50
100
-150
-100
-50
0
50
100
0 200 400 600 800
-100 -50 0 50 100
0
200
400
600
800

 Whitened signals
-3 -2 -1 0 1 2 3 4
-4
-3
-2
-1
0
1
2
3
4
5
6
-4
-3
-2
-1
0
1
2
3
4
5
6
0 200 400 600 800
-3 -2 -1 0 1 2 3 4
0
200
400
600
800

 Rotate to get the maximum value of kurtosis.
-3 -2 -1 0 1 2 3 4 5
-5
-4
-3
-2
-1
0
1
2
3
4
5
-5
-4
-3
-2
-1
0
1
2
3
4
5
0 200 400 600 800
-3 -2 -1 0 1 2 3 4 5
0
200
400
600
800

Statistical Methods

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (15)

Similaire à Statistical Methods

Similaire à Statistical Methods (20)

Statistical Methods