Principal Components Analysis - PyBay 2016

Dimensionality
Reduction using 
Principal Components
Analysis
 
Rumman Chowdhury, Senior Data Scientist
@ruchowdh
rummanchowdhury.com
thisismetis.com

Me:
Political Science PhD, Data Scientist, Teacher, Do-
Gooder. Check me out on twitter: @ruchowdh, or on
my website: rummanchowdhury.com (psst, I post
cool jobs there)
What’s Metis?
Metis accelerates the careers of data scientists by
providing full-time immersive bootcamps, evening
part-time professional development courses, online
training, and corporate programs.
Who is Rumman? What’s a Metis?

What is PCA?
Why do we need dimensionality reduction?
Intuition behind Principal Components Analysis
Coding example

What is Principal Components
Analysis?

What is PCA?
- A shift in perspective
- A reduction in the number of
dimensions

Why do we need dimensionality
reduction?

One dimension:
Small space
Being close quite
probableCigarettes
per day
Curse of Dimensionality

Two
dimensions
Height
Cigarettes per day

Height
Two dimensions:
More space but still not so
much Being close not
improbable
Cigarettes per day

Height
Three
dimensions
Cigarettes per day
Exercise

Height
Three dimensions:
Much larger space
Being close less
probable
Cigarettes per dayExercise

Height
Four
dimensions
Age
Cigarettes per day
Exercise

Age
Height
Four dimensions:
Omg so much space
Being close quite
improbable
Cigarettes per
dayExercise

Thousand dimensions:
Helloooo… hellooo.. helloo…
Can anybody hear meee..
mee.. mee.. mee..
So
alone….

Thousand dimensions:
I speciﬁed you with such high
resolution, with so much
detail, that you don’t look
like anybody else anymore.
You’re unique.

Height
Classification, clustering and other analysis methods
become exponentially difficult with increasing
dimensions.
Cigarettes per day

Height
Classification, clustering and other analysis methods
become exponentially difficult with increasing
dimensions.
To understand how to divide that huge space, we
need a whole lot more data (usually much more
than we do or can have).
Cigarettes per day

Height
Lots of features, lots of data is best. But what if
you don’t have the luxury of ginormous amounts of
data?
Not all features provide the same amount of
information. We can reduce the dimensions
(compress the data) without necessarily losing too
much information.
Cigarettes per day
Dimensionality Reduction

Feature Extraction
Do I have to choose the
dimensions among existing
features?
Height
Cigarettes per day

Why do we need dimensionality reduction?
- To better perform analyses
- …without sacrificing the information we
get from our features
- To better visualize our data

What is the intuition behind PCA?

Height
Cigarettes per day
PC 1PC 2

Height
Cigarettes per day
0.398 (Height) + 0.602 (Cigarettes)

Height
Cigarettes
0.398 (Height) + 0.602 (Cigarettes)

Advantage: You retain more information
Disadvantage: You lose interpretability
2D
Healthy_or_not = logit( β1(Height) + β2(Cigarettes per day) )
Feature selection 1D
Healthy_or_not = logit( β1(Height) )
Feature extraction 1D
Healthy_or_not = logit( β1(0.4*Height + 0.6*Cigarettes per
day) )

3D → 2D Feature Extraction (PCA)
Height
Cigarettes
Exercise

Optimum plane
Height
Cigarettes
Exercise

Cigarettes
Height
Optimum plane
Exercise
A1*(Height)+B1*(cigarettes)+C1*(Exercise)
A2 *(Height) + B2 *(Cigarettes) + C2 *(Exercise)

Singular Value Decomposition
The eigenvectors and eigenvalues of a covariance (or
correlation) matrix represent the "core" of a PCA:
The eigenvectors (principal components) determine
the directions of the new feature space, and the
eigenvalues determine their magnitude.
In other words, the eigenvalues explain the
variance of the data along the new feature axes.
PCA Math

Correlation or Covariance Matrix?
Use the correlation matrix to calculate the principal components
if variables are measured by different scales and you want to
standardize them or if the variances differ widely between
variables. You can use the covariance or correlation matrix in all
other situations.
Matrix Selection

Kaiser Method
Retain any components with eigenvector values
greater than 1
Scree Test
Bar plot that shows the variance explained by each
component. Ideally you will see a clear drop-off
(elbow).
Percent Variance Explained
Calculate the sum of variance explained by each
component, stop when you reach a point.
How do I know how many dimensions to
reduce by?

What is the intuition behind PCA?
- We are attempting to resolve the curse of
dimensionality
- by shifting our perspective
- and keeping the eigenvectors that explain the
highest amount of variance.
- We select those components based on our end
goal, or by particular methods (Kaiser, Scree, %
Variance).

Principal Components Analysis - PyBay 2016

Recommandé

Recommandé

Contenu connexe

Similaire à Principal Components Analysis - PyBay 2016

Similaire à Principal Components Analysis - PyBay 2016 (20)

Dernier

Dernier (20)

Principal Components Analysis - PyBay 2016