A short brief of distance and statistical distance which is core of multivariate analysis.................you will get here some more simple conception about distances and statistical distance.
2. Md. Menhazul Abedin
M.Sc. Student
Dept. of Statistics
Rajshahi University
Mob: 01751385142
Email: menhaz70@gmail.com
3. Objectives
โข To know about the meaning of statistical
distance and itโs relation and difference with
general or Euclidean distance
4. Content
๏ถDefinition of Euclidean distance
๏ถConcept & intuition of statistical distance
๏ถDefinition of Statistical distance
๏ถNecessity of statistical distance
๏ถConcept of Mahalanobis distance (population
&sample)
๏ถDistribution of Mahalanobis distance
๏ถMahalanobis distance in R
๏ถAcknowledgement
9. we see that two specific points in each picture
Our problem is to determine the length between
two points .
But how ??????????
Assume that these pictures are placed in two
dimensional spaces and points are joined by a
straight line
10. Let 1st point is (๐ฅ1,๐ฆ1) and 2nd point is (๐ฅ2, ๐ฆ2)
then distance is
D= โ ( (๐ฅ1โ๐ฅ2)2
+ (๐ฆ1 โ ๐ฆ2)2
)
What will be happen when dimension is three
12. Distance is given by
โข Points are (x1,x2,x3) and (y1,y2,y3)
(๐ฅ1 โ ๐ฆ1)2+(๐ฅ2 โ ๐ฆ2)2+(๐ฅ3 โ ๐ฆ3)2
13. For n dimension it can be written
as the following expression and
named as Euclidian distance
22
22
2
11
2121
)()()(),(
),,,(),,,,(
pp
pp
yxyxyxQPd
yyyQxxxP
๏ญ๏ซ๏ซ๏ญ๏ซ๏ญ๏ฝ ๏
๏๏
14. 12/12/2016 14
Properties of Euclidean Distance and
Mathematical Distance
โข Usual human concept of distance is Eucl. Dist.
โข Each coordinate contributes equally to the distance
22
22
2
11
2121
)()()(),(
),,,(),,,,(
pp
pp
yxyxyxQPd
yyyQxxxP
๏ญ๏ซ๏ซ๏ญ๏ซ๏ญ๏ฝ ๏
๏๏
14
Mathematicians, generalizing its three properties ,
1) d(P,Q)=d(Q,P).
2) d(P,Q)=0 if and only if P=Q and
3) d(P,Q)=<d(P,R)+d(R,Q) for all R, define distance
on any set.
17. โข The Manhattan distance is the simple sum of
the horizontal and vertical components,
whereas
the diagonal distance might be computed by
applying the Pythagorean Theorem .
19. โข Manhattan distance 12 unit
โข Diagonal or straight-line distance or Euclidean
distance is 62 + 62 =6โ2
We observe that Euclidean distance is less than
Manhattan distance
23. Relationship between Manhattan &
Euclidean distance.
โข It now seems that the distance from A to C is 7 blocks,
while the distance from A to B is 6 blocks.
โข Unless we choose to go off-road, B is now closer to A
than C.
โข Taxicab distance is sometimes equal to Euclidean
distance, but otherwise it is greater than Euclidean
distance.
Euclidean distance <Taxicab distance
Is it true always ???
Or for n dimension ???
27. For high dimension
โข It holds for high dimensional case
โข ฮฃ โ๐ฅ๐ โ ๐ฆ๐โ2
โค ฮฃ โ๐ฅ๐ โ ๐ฆ๐โ2
+ 2ฮฃโ๐ฅ๐ โ ๐ฅ๐โโ๐ฅ๐ โ ๐ฅ๐โ
Which implies
ฮฃ (๐ฅ๐ โ ๐ฆ๐)2 โค ฮฃโ๐ฅ๐ โ ๐ฅ๐โ
๐ ๐ธ โค ๐ ๐
28. 12/12/2016
Statistical Distance
โข Weight coordinates subject to a great deal of
variability less heavily than those that are not
highly variable
Whoisnearerto
datasetifitwere
point?
Same
distance from
origin
29. โข Here
variability in x1 axis > variability in x2 axis
๏ฑ Is the same distance meaningful from
origin ???
Ans: no
But, how we take into account the different
variability ????
Ans : Give different weights on axes.
30. 12/12/2016
Statistical Distance for Uncorrelated Data
๏จ ๏ฉ ๏จ ๏ฉ
22
2
2
11
2
12*
2
2*
1
222
*
2111
*
1
21
),(
/,/
)0,0(),,(
s
x
s
x
xxPOd
sxxsxx
OxxP
๏ซ๏ฝ๏ซ๏ฝ
๏ฝ๏ฝ
weight
Standardization
31. all point that have coordinates (x1,x2) and
are a constant squared distance , c2
from the
origin must satisfy
๐ฅ12
๐ 11
+
๐ฅ22
๐ 22
=๐2
But โฆ how to choose c ?????
Itโs a problem
Choose c as 95% observation fall in this area โฆ.
๐ 11 > ๐ 22
= >
1
๐ 11
<
1
๐ 22
33. โข This expression can be generalized as โฆโฆโฆ
statistical distance from an arbitrary point
P=(x1,x2) to any fixed point Q=(y1,y2)
;lk;lk;
For P dimensionโฆโฆโฆโฆโฆ..
34. Remark :
1) The distance of P to the origin O is
obtain by setting all ๐ฆ๐ = 0
2) If all ๐ ๐๐ are equal Euclidean
distance formula is appropriate
36. โข How do you measure the statistical distance of
the above data set ??????
โข Ans : Firstly make it uncorrelated .
โข But why and howโฆโฆโฆ???????
โข Ans: Rotate the axis keeping origin fixed.
41. Choice of ๐
๏ถWhat ๐ will you choice ?
๏ถHow will you do it ?
๏ถ Data matrix โ Centeralized data matrix โ Covariance of
data matrix โ Eigen vector
๏ถTheta = angle between 1st eigen vector and [1,0]
or
angle between 2nd eigen vector and [0,1]
42. Why is that angle between 1st eigen vector and
[0,1] or angle between 2nd eigen vector and [1,0]
??
Ans: Let B be a (p by p) positive definite matrix
with eigenvalues ฮป1โฅฮป2โฅฮป3โฅ โฆ โฆ . . โฅ ฮปp>0
and associated normalized eigenvectors
๐1, ๐2, โฆ โฆ โฆ , ๐ ๐.Then
๐๐๐ฅ ๐ฅโ 0
๐ฅโฒ ๐ต๐ฅ
๐ฅโฒ ๐ฅ
= ฮป1 attained when x= ๐1
๐๐๐ ๐ฅโ 0
๐ฅโฒ ๐ต๐ฅ
๐ฅโฒ ๐ฅ
= ฮป ๐ attained when x= ๐ ๐
44. Choice of ๐
#### Excercise 16.page(309).Heights in inches (x) &
Weights in pounds(y). An Introduction to Statistics
and Probability M.Nurul Islam #######
x=c(60,60,60,60,62,62,62,64,64,64,66,66,66,66,68,
68,68,70,70,70);x
y=c(115,120,130,125,130,140,120,135,130,145,135
,170,140,155,150,160,175,180,160,175);y
############
V=eigen(cov(cdata))$vectors;V
as.matrix(cdata)%*%V
plot(x,y)
49. โข ################ comparison of both
method ############
comparison=tdata -
as.matrix(cbind(xx,yy));comparison
round(comparison,4)
50. ########### using package. md from original data #####
md=mahalanobis(data,colMeans(data),cov(data),inverted =F);md
## md =mahalanobis distance
######## mahalanobis distance from transformed data ########
tmd=mahalanobis(tdata,colMeans(tdata),cov(tdata),inverted =F);tmd
###### comparison ############
md-tmd
51. Mahalanobis distance : Manually
mu=colMeans(tdata);mu
incov=solve(cov(tdata));incov
md1=t(tdata[1,]-mu)%*%incov%*%(tdata[1,]-
mu);md1
md2=t(tdata[2,]-mu)%*%incov%*%(tdata[2,]-
mu);md2
md3=t(tdata[3,]-mu)%*%incov%*%(tdata[3,]-
mu);md3
............. โฆโฆโฆโฆโฆ. โฆโฆโฆโฆ..
md20=t(tdata[20,]-mu)%*%incov%*%(tdata[20,]-
mu);md20
md for package and manully are equal
59. โข The above distances are completely
determined by the coefficients(weights)
๐๐๐ ; i, k = 1,2,3, โฆ โฆ โฆ p. These are can be
arranged in rectangular array as
this array (matrix) must be symmetric positive
definite.
60. Why Positive definite ????
Let A be a positive definite matrix .
A=CโC
XโAX= XโCโCX = (CX)โ(CX) = YโY It obeys
all the distance property.
XโAX is distance ,
For different A it gives different distance .
61. โข Why positive definite matrix ????????
โข Ans: Spectral decomposition : the spectral
decomposition of a kรk symmetric matrix
A is given by
โข Where (ฮป๐, ๐๐); ๐ = 1,2, โฆ โฆ โฆ , ๐ are pair of
eigenvalues and eigenvectors.
And ฮป1 โฅ ฮป2 โฅ ฮป3 โฅ โฆ โฆ . . And if pd ฮป๐ > 0
& invertible .
66. โข Consider the Euclidean distances from the
point Q to the points P and the origin O.
โข Obviously d(PQ) > d (QO )
๏ฑ But, P appears to be more like the points in
the cluster than does the origin .
๏ฑ If we take into account the variability of the
points in cluster and measure distance by
statistical distance , then Q will be closer to P
than O .
67. Mahalanobis distance
โข The Mahalanobis distance is a descriptive
statistic that provides a relative measure of a
data point's distance from a common point. It
is a unitless measure introduced by P. C.
Mahalanobis in 1936
68. Intuition of Mahalanobis Distance
โข Recall the eqution
d(O,P)= ๐ฅโฒ ๐ด๐ฅ
=> ๐2
(๐, ๐) =๐ฅโฒ
๐ด๐ฅ
Where x=
๐ฅ1
๐ฅ2
, A=
๐11 ๐12
๐21 ๐22
71. Mahalanobis Distance
โข Mahalanobis used ,inverse of covariance
matrix ฮฃ instead of A
โข Thus ๐2
๐, ๐ = ๐ฅโฒ
ฮฃโ1
๐ฅ โฆโฆโฆโฆโฆ..(1)
โข And used ๐ (๐๐๐๐ก๐๐ ๐๐ ๐๐๐๐ฃ๐๐ก๐ฆ ) instead of y
๐2
(๐, ๐) = (๐ฅ โ ๐ )โฒฮฃโ1
(๐ฅ โ ๐)โฆโฆโฆ..(2)
Mah-
alan-
obis
dist-
ance
72. Mahalanobis Distance
โข The above equations are nothing but
Mahalanobis Distance โฆโฆ
โข For example, suppose we took a single
observation from a bivariate population with
Variable X and Variable Y, and that our two
variables had the following characteristics
73. โข single observation, X = 410 and Y = 400
The Mahalanobis distance for that single value
as:
75. โข Therefore, our single observation would have
a distance of 1.825 standardized units from
the mean (mean is at X = 500, Y = 500).
โข If we took many such observations, graphed
them and colored them according to their
Mahalanobis values, we can see the elliptical
Mahalanobis regions come out
76. โข The points are actually distributed along two
primary axes:
77.
78. If we calculate Mahalanobis distances for each
of these points and shade them according to
their distance value, we see clear elliptical
patterns emerge:
79.
80. โข We can also draw actual ellipses at regions of
constant Mahalanobis values:
68%
obs
95%
obs
99.7%
obs
81. โข Which ellipse do you choose ??????
๏ฑAns : Use the 68-95-99.7 rule .
1) about two-thirds (68%) of the points should
be within 1 unit of the origin (along the axis).
2) about 95% should be within 2 units
3)about 99.7 should be within 3 units
83. Sample Mahalanobis Distancce
โข The sample Mahalanobis distance is made by
replacing ฮฃ by S and ๐ by ๐
โข i.e (X- ๐)โ ๐โ1
(X- ๐)
84. For sample
(X- ๐ฟ)โ ๐บโ๐
(X- ๐ฟ)โค ๐ ๐
๐ (โ)
Distribution of mahalanobis distance
85. Distribution of mahalanobis distance
Let ๐1, ๐2, ๐3, โฆ โฆ โฆ , ๐ ๐ be in dependent
observation from
any population with
mean ๐ and finite (nonsingular) covariance ฮฃ .
Then
๏ถ ๐ ( ๐ โ ๐) is approximately ๐๐(0, ฮฃ)
and
๏ถ ๐ ๐ โ ๐ โฒ
๐โ1
( ๐ โ ๐) is approximately ฯ ๐
2
for n-p large
This is nothing but central limit theorem
86. Mahalanobis distance in R
โข ########### Mahalanobis Distance ##########
โข x=rnorm(100);x
โข dm=matrix(x,nrow=20,ncol=5,byrow=F);dm ##dm = data matrix
โข cm=colMeans(dm);cm ## cm= column means
โข cov=cov(dm);cov ##cov = covariance matrix
โข incov=solve(cov);incov ##incov= inverse of
covarianc matrix
87. Mahalanobis distance in R
โข ####### MAHALANOBIS DISTANCE : MANUALY ######
โข @@@ Mahalanobis distance of first
โข observation@@@@@@
โข ob1=dm[1,];ob1 ## first observation
โข mv1=ob1-cm;mv1 ## deviatiopn of first
observation from
center of gravity
โข md1=t(mv1)%*%incov%*%mv1;md1 ## mahalanobis
distance of first
observation from center of
gravity
โข
88. Mahalanobis distance in R
โข @@@@@@ Mahalanobis distance of second
observation@@@@@
โข ob2=dm[2,];ob2 ## second observation
โข mv2=ob2-cm;mv2 ## deviatiopn of second
โข observation from
โข center of gravity
โข md2=t(mv2)%*%incov%*%mv2;md2 ##mahalanobis
distance of second
observation from center of
gravity
................ โฆโฆโฆโฆโฆโฆ โฆ..โฆโฆโฆโฆโฆ
89. Mahalanobis distance in R
โฆโฆโฆ....... โฆโฆโฆโฆโฆโฆ โฆโฆโฆโฆโฆ
@@@@@ Mahalanobis distance of 20th
observation@@@@@
โข Ob20=dm[,20];ob20 [## 20th observation
โข mv20=ob20-cm;mv20 ## deviatiopn of 20th
observation from
center of gravity
โข md20=t(mv20)%*%incov%*%mv20;md20
## mahalanobis distance of
20thobservation from
center of gravity
90. Mahalanobis distance in R
####### MAHALANOBIS
DISTANCE : PACKAGE ########
โข md=mahalanobis(dm,cm,cov,inverted =F);md
## md =mahalanobis
distance
โข md=mahalanobis(dm,cm,cov);md
91. Another example
โข x <- matrix(rnorm(100*3), ncol = 3)
โข Sx <- cov(x)
โข D2 <- mahalanobis(x, colMeans(x), Sx)