2. Overview:
• What is Principal Component Analysis
• Computing the compnents in PCA
• Dimensionality Reduction using PCA
• A 2D example in PCA
• Applications of PCA in computer vision
• Importance of PCA in analysing data in higher dimensions
• Questions
3. Feature reduction vs feature selection
• Feature reduction
– All original features are used
– The transformed features are linear combinations
of the original features.
• Feature selection
– Only a subset of the original features are used.
4. Why feature reduction?
• Most machine learning and data mining
techniques may not be effective for high-
dimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as the
dimension increases.
• The intrinsic dimension may be small.
– For example, the number of genes responsible for a
certain type of disease may be small.
5. Why feature reduction?
• Visualization: projection of high-dimensional data
onto 2D or 3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
6. Applications of feature reduction
• Face recognition
• Handwritten digit recognition
• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
7. • We study phenomena that can not be directly observed
– ego, personality, intelligence in psychology
– Underlying factors that govern the observed data
• We want to identify and operate with underlying latent
factors rather than the observed data
– E.g. topics in news articles
– Transcription factors in genomics
• We want to discover and exploit hidden relationships
– “beautiful car” and “gorgeous automobile” are closely related
– So are “driver” and “automobile”
– But does your search engine know this?
– Reduces noise and error in results
Why Factor or Component Analysis?
8. • We have too many observations and dimensions
– To reason about or obtain insights from
– To visualize
– Too much noise in the data
– Need to “reduce” them to a smaller set of factors
– Better representation of data without losing much information
– Can build more effective data analyses on the reduced-
dimensional space: classification, clustering, pattern recognition
• Combinations of observed variables may be more
effective bases for insights, even if physical meaning is
obscure
Why Factor or Component Analysis?
9. • Discover a new set of factors/dimensions/axes against
which to represent, describe or evaluate the data
– For more effective reasoning, insights, or better visualization
– Reduce noise in the data
– Typically a smaller set of factors: dimension reduction
– Better representation of data without losing much information
– Can build more effective data analyses on the reduced-
dimensional space: classification, clustering, pattern recognition
• Factors are combinations of observed variables
– May be more effective bases for insights, even if physical
meaning is obscure
– Observed data are described in terms of these factors rather than
in terms of original variables/dimensions
Factor or Component Analysis
10. Basic Concept
Areas of variance in data are where items can be best discriminated and key
underlying phenomena observed
Areas of greatest “signal” in the data
If two items or dimensions are highly correlated or dependent
They are likely to represent highly related phenomena
If they tell us about the same underlying variance in the data, combining them to
form a single measure is reasonable
Parsimony
Reduction in Error
So we want to combine related variables, and focus on uncorrelated or
independent ones, especially those along which the observations have high
variance
We want a smaller set of variables that explain most of the variance in the
original data, in more compact and insightful form
11. Basic Concept
What if the dependences and correlations are not so strong or
direct?
And suppose you have 3 variables, or 4, or 5, or 10000?
Look for the phenomena underlying the observed covariance/co-
dependence in a set of variables
Once again, phenomena that are uncorrelated or independent, and
especially those along which the data show high variance
These phenomena are called “factors” or “principal components”
or “independent components,” depending on the methods used
Factor analysis: based on variance/covariance/correlation
Independent Component Analysis: based on independence
12. Principal Component Analysis
Most common form of factor analysis
The new variables/dimensions
Are linear combinations of the original ones
Are uncorrelated with one another
Orthogonal in original dimension space
Capture as much of the original variance in the data as
possible
Are called Principal Components
13. Principal Component Analysis
•Most common form of factor analysis
•The new variables/dimensions
–Are linear combinations of the original ones
–Are uncorrelated with one another
•Orthogonal in original dimension space
–Capture as much of the original variance in the data as possible
–Are called Principal Components
14. What are the new axes?
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along any one axis
15. Principal Components
•First principal component is the direction of greatest
variability (covariance) in the data
• Second is the next orthogonal (uncorrelated) direction of greatest variability
– So first remove all the variability along the first
component, and then find the next direction of
greatest variability
• And so on …
16. Principal Components Analysis (PCA)
• Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of uncorrelated
variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning
• Properties
–It can be viewed as a rotation of the existing axes to new positions in the
space defined by original variables
–New axes are orthogonal and represent the directions with maximum
variability
17. Computing the Components
• Data points are vectors in a multidimensional space
• Projection of vector x onto an axis (dimension) u is u.x
•Direction of greatest variability is that in which the average square of the
projection is greatest
– I.e. u such that E((u.x)2) over all x is maximized
–(we subtract the mean along each dimension, and center the original axis
system at the centroid of all data points, for simplicity)
– This direction of u is the direction of the first Principal Component
18. Computing the Components
• E((u.x)2) = E ((u.x) (u.x)T) = E (u.x.x T.uT)
•The matrix S = x.xT contains the correlations (similarities) of the original axes based
on how the data values project onto them
• So we are looking for w that maximizes uSuT, subject to u being unit-length
• It is maximized when w is the principal eigenvector of the matrix S, in which case
–uCuT = uλuT = λ if u is unit-length, where λ is the principal eigenvalue of the
correlation matrix C
– The eigenvalue denotes the amount of variability captured along that dimension
19. Why the Eigenvectors?
Maximise uTxxTu s.t uTu = 1
Construct Langrangian uTxxTu – λuTu
Vector of partial derivatives set to zero
xxTu – λu = (xxT – λI) u = 0
As u ≠ 0 then u must be an eigenvector of xxT with eigenvalue λ
20. Computing the Components
• Similarly for the next axis, etc.
•So, the new axes are the eigenvectors of the matrix of correlations of the
original variables, which captures the similarities of the original variables
based on how data samples project to them
• Geometrically: centering followed by rotation
• – Linear transformation
21. PCs, Variance and Least-Squares
• The first PC retains the greatest amount of variation in the sample
• The kth PC retains the kth greatest fraction of the variation in the sample
•The kth largest eigenvalue of the correlation matrix C is the variance in the
sample along the kth PC
•The least-squares view: PCs are a series of linear least
squares fits to a sample, each orthogonal to all previous ones
22. How Many PCs?
•For n original dimensions, correlation matrix is
nxn, and has up to n eigenvectors. So n PCs.
•Where does dimensionality reduction come
from?
23. Dimensionality Reduction
Can ignore the components of lesser significance.
You do lose some information, but if the eigenvalues are small, you don’t lose
much
– n dimensions in original data
– calculate n eigenvectors and eigenvalues
– choose only the first p eigenvectors, based on their eigenvalues
– final data set has only p dimensions
25. PCA Example –STEP 1
• Subtract the mean
from each of the data dimensions. All the x values
have x subtracted and y values have y subtracted
from them. This produces a data set whose mean is
zero.
Subtracting the mean makes variance and covariance
calculation easier by simplifying their equations. The
variance and co-variance values are not affected by
the mean value.
26. PCA Example –STEP 1
DATA:
x y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9
ZERO MEAN DATA:
x y
.69 .49
-1.31 -1.21
.39 .99
.09 .29
1.29 1.09
.49 .79
.19 -.31
-.81 -.81
-.31 -.31
-.71 -1.01
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
27. PCA Example –STEP 1
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
28. PCA Example –STEP 2
• Calculate the covariance matrix
cov = .616555556 .615444444
.615444444 .716555556
• since the non-diagonal elements in this covariance
matrix are positive, we should expect that both the x
and y variable increase together.
29. PCA Example –STEP 3
• Calculate the eigenvectors and eigenvalues of
the covariance matrix
eigenvalues = .0490833989
1.28402771
eigenvectors = -.735178656 -.677873399
.677873399 -.735178656
30. PCA Example –STEP 3
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
•eigenvectors are plotted as
diagonal dotted lines on the
plot.
•Note they are
perpendicular to each other.
•Note one of the
eigenvectors goes through
the middle of the points, like
drawing a line of best fit.
•The second eigenvector
gives us the other, less
important, pattern in the
data, that all the points
follow the main line, but are
off to the side of the main
line by some amount.
31. PCA Example –STEP 4
• Reduce dimensionality and form feature vector
the eigenvector with the highest eigenvalue is the principle
component of the data set.
In our example, the eigenvector with the larges eigenvalue
was the one that pointed down the middle of the data.
Once eigenvectors are found from the covariance matrix, the
next step is to order them by eigenvalue, highest to lowest.
This gives you the components in order of significance.
32. PCA Example –STEP 4
Now, if you like, you can decide to ignore the components of
lesser significance.
You do lose some information, but if the eigenvalues are
small, you don’t lose much
• n dimensions in your data
• calculate n eigenvectors and eigenvalues
• choose only the first p eigenvectors
• final data set has only p dimensions.
33. PCA Example –STEP 4
• Feature Vector
FeatureVector = (eig1 eig2 eig3 … eign)
We can either form a feature vector with both of the
eigenvectors:
-.677873399
-.735178656
-.735178656
.677873399
or, we can choose to leave out the smaller, less
significant component and only have a single
column:
- .677873399
- .735178656
34. PCA Example –STEP 5
• Deriving the new data
FinalData = RowFeatureVector x RowZeroMeanData
RowFeatureVector is the matrix with the eigenvectors in the
columns transposed so that the eigenvectors are now in the
rows, with the most significant eigenvector at the top
RowZeroMeanData is the mean-adjusted data
transposed, ie. the data items are in each column,
with each row holding a separate dimension.
35. PCA Example –STEP 5
FinalData transpose: dimensions
along columns
x y
-.827970186 -.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
-1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
36. PCA Example –STEP 5
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
37. Reconstruction of original Data
• If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example let
us assume that we considered only the x dimension…
38. Reconstruction of original Data
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
x
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056
39. Applications in computer vision-
PCA to find patterns-
• 20 face images: NxN size
• One image represented as follows-
• Putting all 20 images in 1 big matrix as follows-
• Performing PCA to find patterns in the face images
• Idenifying faces by measuring differences along the new axes (PCs)
40. PCA for image compression:
• Compile a dataset of 20 images
• Build the covariance matrix of 20 dimensions
• Compute the eigenvectors and eigenvalues
• Based on the eigenvalues, 5 dimensions can be left out, those with the least
eigenvalues.
• 1/4th of the space is saved.
41. Importance of PCA
• In data of high dimensions, where graphical representation is difficult, PCA is a
powerful tool for analysing data and finding patterns in it.
• Data compression is possible using PCA
• The most efficient expression of data is by the use of perpendicular components,
as done in PCA.
42. Questions:
• What do the eigenvectors of the covariance matrix while computing the principal
components give us?
• At what point in the PCA process can we decide to compress the data?
• Why are the principal components orthogonal?
• How many different covariance values can you calculate for an n-dimensional data
set?