This document provides an overview of anomaly detection using BigML. It defines anomaly detection and describes how BigML creates anomaly detectors using isolation forests in an unsupervised learning approach. The detector scores instances based on the number of splits needed to isolate them, with lower numbers of splits corresponding to higher anomaly scores. Applications of anomaly detection include detecting fraud, intrusions, and failures by finding rare or abnormal instances in datasets.
2. BigML Education Program 2Ensembles
In This Video
• Definition of anomaly detection
• Creation and interpretation of a BigML anomaly detector
• Generating an anomaly-free dataset
• Scoring instances with the trained anomaly detector
3. BigML Education Program 3Ensembles
Unsupervised Learning
• Supervised learning
• One field is the “objective field” (or “target
variable”, or “label”) that is to be predicted
• The algorithm is trying to create a model that
makes this prediction accurately
• Unsupervised learning
• Algorithm is trying to discover some structure in
the data
• Learned structure can often be applied to new
data
5. BigML Education Program 5Ensembles
Applications
• Detecting rare, malicious behavior (fraud, intrusion)
• Alerting service technicians to possible failures
• Filtering of anomalies for “cleaner” supervised learning
• Assessing model competence
6. BigML Education Program 6Ensembles
Isolation Forests
4 Chapter 2. Understanding Anomalies
Figure 2.1: Graphic representation example of a normal data point (left) versus an
anomalous data point (right)
When all instances have been isolated, BigML automatically calculates an anomaly score by averaging
the number of splits needed to isolate an instance across trees in the ensemble. Lower number of
splits will result in higher scores. Then these averages are normalized to get a final score that can take
values between 0% and 100%. This score measures how anomalous an instance is, e.g., the red data
point on the left in Figure 2.1 took 10 partitions to isolate, while the one on the right took only 4, so the
one on the right will have a higher anomaly score.
xo - Easy to Isolate
4
Figure 2.1: Graphic representation example of a n
anomalous data point (right)
When all instances have been isolated, BigML automatically
the number of splits needed to isolate an instance across
splits will result in higher scores. Then these averages are n
values between 0% and 100%. This score measures how a
point on the left in Figure 2.1 took 10 partitions to isolate, w
one on the right will have a higher anomaly score.
xi - Difficult to Isolate
7. BigML Education Program 7Ensembles
Review
• Anomaly detection is a way of detecting unusual
instances in your dataset
• Detecting anomalies has many important real-world use
cases
• The BigML interface allows you to easily view and
interact with the detected anomalies in your dataset
• You can create a new dataset with your anomaly
detector, either by filtering anomalies from the training
data, or scoring a new dataset with the trained anomaly
detector