Introduction to conventional machine learning techniques
Principal Component Analysis For Novelty Detection
1. Principal Component Analysis for
Novelty Detection
A journal article submitted to and accepted by Pattern Recognition Letters
Jordan McBain, P.Eng.
Markus Timusk, PhD, P.Eng.
2. Condition Monitoring
Maintenance technique
Maintenance undertaken when some indicator of health is
flagged
Advanced technique employed when cost-benefit analysis
justifies the expense of monitoring equipment
Alternative to run-to-failure maintenance and statistically
determined time-based maintenance
Employ pattern recognition to automate diagnosis
Expert system employed to replicate technicians
maintenance insight
Computer and sensors replaces technician and screw driver set
atop vibrating machine – the nature of the vibration used to
discern state
3. Pattern Recognition
Equality insufficient means of classifying real-world
members of class (noise, variance, etc)
Pattern recognition
Real-world signals presumed to be representative of class
reduced to representative n-dimensional feature vectors
Plotted in N-dimensional space
Decision boundary generated with pattern recognition
techniques
Employed as classification rule
Problems
Choice of features
How representative?
Maximize number of features?
Curse of dimensionality
Imbalance of data
4. Principal Component Analysis
One technique used to find “optimal” set of features
Finds the axes of normally distributed data
Select the largest axes and omit smaller ones to define
new basis
Project data onto basis to reduce dimensionality of
problem space
Each feature presumed to be normally distributed
5. N-dimensional scattering of features presumed
independent
Combined probability:
P( A B) P( A)* P( B)
6. d d 1 xi i 2
1 2
( )
p( x ) p ( xi ) e i
i 1 i 1 2 i
d
1 x i 2 1 t
( i )
1 2i1 1 2
(x ) 1
(x )
d
e i
e
d
(2 ) d (2 ) | |
i
i 1
Find principal components
(i.e. axes of hyper-ellipsoidal
distribution)
Select maximum variance
(largest axes)
Eigenvalue problem
Eigenvectors – principle
components
Eigenvalues – size of
axis
7. Novelty Detection
Deals with imbalance of data between classes
Fault detection in machinery
Easy to collect data representative of healthy state
Difficult to collect data representative of faulted states
Costly to break machinery
Operationally unacceptable
Poor database of faults kept
Can never capture them all!
Model healthy data with decision boundary
If test patterns fall outside, classify as a fault!
8. Problem
PCA is best for selecting a subspace that best
represents the data
In pattern recognition, we seek to discriminant
between classes
Objective of most feature reduction techniques are
not optimized for novelty detection
10. Feature Reduction Techniques
Feature Selection vs. Feature Extraction
Selection
Choosing small subsets of features that are adequate to
describe classes
E.g. “Search”
Examines all subsets of feature combinations to find the one which
maximizes some objective function
May employ classifier error as objective function
Exponential explosion
Heuristics to mitigate possible
If computationally feasible, gives the best results
Extraction
Computes a small number of new features form the set of old
features
E.g. PCA
11. Principal Component Analysis
Seeks a subspace in which the data representation
error is minimal
Development
For a set of n vectors in d-dimensional space
seek the equation of a hyper plane onto which the data may be
projected with minimal representation error
Hyper plane fixed at the data’s mean, m
Hyper plane’s orientation defined by direction vector, w (normal
definition of a plane)
Derive error function
12. Optimization problem well known eigenvalue
problem
Resultant feature space is linear
May not represent non-linear and changing data well
Kernel PCA and Dynamic PCA
Techniques only suitable for representing data not
discriminating between them
Source: Duda, 2000
13. Multiple Discriminant Analysis
Seeks to find efficient subspaces for discrimination
rather than representation
Development
Two class problem with d-dimensional set of n-vectors
grouped into D1 and D2
Projected onto some direction vector w to give
Consequently grouped into subsets Y1 and Y 2
Find the direction vector w such that the distance
between projected sample means m1 and m2 is
maximized
Rationalize the distance against the relative sample size
14. Reduces to
Solution is described as “analogous to the well known
Rayleigh quotient:”
1
w S w (m1 m2 )
Technique extended for problems with n-classes
Objective to maximize the spread between all classes in the
projected space
Source: Duda, 2000
16. Development
Objective: distinguish between normal and abnormal
classes
KFDA inappropriate (assumes classes group well into
separate classes)
Novelty detection – classes may cluster well but abnormal
classes expected to orbit the normal data
Means could overlap
Eliminating previous objective functions
Approach: find the subspace maximizing difference
between average spread of the normal class and
average spread of the abnormal class measured
from the mean of the normal class
17. Mathematically, for an outlier class containing b
elements and target class containing a-elements
with mean m_t
To simplify, introduce outlier scatter matrix, O, for
outlier data centered at m_t
Reducing to
18. Maximize this objective function
Find the eigenvectors and eigenvalues of the matrix St-O
Select the first k largest eigenvalues and use
corresponding eigenvectors as new basis
Project data onto new basis
Proceed with classification
Limitations
Still dependant on assumption of normal data distribution
(as are other PCA techniques)
Assumption: normal data scatter somewhat circularly and
outlier data orbit nicely without intruding
(as with PCA and MDA )
Machinery vibration data are not normally Gaussian (heuristic)
19. Validation: Artificial Data
Artificial 3-d data set
Normal distribution:
spherical (radius 50) centered at origin
Outlier distribution:
randomly generated spherical distribution (radius 100)
Not permitted to fall within cylinder concentric with the normal
data’s sphere and oriented with length parallel to [1,1,1]
20. Validation: Artifical Data
Results (reduced to 2 dimensions)
Subspace’s normal vector only 7 degrees off from
expected [1,1,1]
22. Apparatus
Spectraquest gear dynamics simulator
3-hp motor
Magnetic particle brake loading
National Instruments PXI data acquisition and control
Accelerometers (sampling 4kHz)
24. Feature Extraction
Autoregressive model
a model of a statistical process generated by regressing
previous values of that statistical process with itself
Sampling of sampled signal that best represents the
original sampling
Order 10
Segmentation
Vibration data segmented into groups based on
intervals with constant number of shaft rotations
Gaussian Window
70% overlap between segments
30. Motivation and Development
The above violates assumption of novelty detection
Limited data from fault classes
In the case where we know nothing of the outlier
classes
Work with what we have: normal data
Minimize variance of normal data
33. Conclusions
Reduce a large feature space to a smaller one
Mitigate the curse of dimensionality
Objective function tweaked for novelty detection
Similar to MDA but modified to accommodate case
where normal and outlier means are closely
separated
Results good for artificial and machinery data
Future work
Extend technique with kernels
Difficult problem due to need for mean
Thanks
CEMI
Dr. Mechefske, Queens