• Principal component analysis (PCA) is a statistical technique that is useful
for the compression and classification of data.
• The purpose is to reduce the dimensionality of a data set (sample) by
finding a new set of variables, smaller than the original set of variables
• The new data set retains most of the information in form of the variation
present in the sample, given by the correlations between the original
variables in the large data set.
• The new variables, called principal components (PCs), are uncorrelated,
and are arranged by the fraction of the total information each retains.
• The features are selected on the basis of variance that they cause in the
output. Original features of the dataset are converted to the Principal
Components which are the linear combinations of the existing features.
• The feature that causes highest variance is the first Principal
Component. The feature that is responsible for second highest
variance is considered the second Principal Component, and so on.
• Traditionally, principal component analysis is performed on a square
symmetric matrix.
• PCA reduces attribute space from a larger number of variables to a
smaller number of factors and as such is a "non-dependent"
procedure
• Step 1: Get some data
• Step 2: Subtract the mean
• Step 3: Calculate the covariance matrix
• Step 4: Calculate the eigenvectors and eigenvalues of the covariance
matrix
• Step 5: Choosing components and forming a feature vector
• Step 6: Deriving the new data set
• Advantages
• Removes Correlated Features
• 2. Improves Algorithm Performance
• 3. Reduces Overfitting
• 4. Improves Visualization
• Disadvantages :
• Independent variables become less interpretable
• Data standardization is must before PCA
• 3. Information Loss