2. Outlier Analysis
Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
Causes
Measurement / Execution errors
Inherent data variability
Outliers – maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
2
3. Outlier Mining
Given n data points and k – expected number of
outliers find the top k dissimilar objects
Define inconsistent data
Residuals in Regression
Difficulties – Multi-dimensional data, non-numeric data
Mine the outliers
Visualization based methods
Not applicable to cyclic plots, high dimensional data and categorical data
Approaches
Statistical Approach
Distance-based approach
Density based outlier approach
Deviation based approach
3
4. Statistical Distribution-based Outlier
detection
Assumes data follows a probability distribution and uses
discordancy test
Discordancy testing
Working hypothesis – H: oi ∈ F i=1,2,..n
Test verifies whether an object oi is significantly different from F
Significance probability SP(vi) = Prob(T>vi)
IF SP is small oi is discordant and working hypothesis is rejected
and alternate hypothesis that oi comes from another distribution
model G is adopted
4
5. Statistical Distribution-based Outlier
detection
Alternative distributions
Inherent alternative distribution
Alternative hypothesis: All objects arise from another distribution G
Mixture alternative distribution
Discordant values are not outliers but contaminants from G H’: oi ∈ (1-
λ) F + λG i=1,2,..n
Slippage alternative distribution
Some Objects are independent observations from a modified version
of F (different parameters)
5
6. Statistical Distribution-based Outlier
detection
Procedures for detecting Outliers
Block procedures
All are outliers or all are consistent
Consecutive Procedures
Inside-out procedure: Least likely object is tested first
If it is an outlier – more extreme values are also considered as outliers
Disadvantages of Statistical Approach
Tests are for single attributes
Data distribution may not be known
6
7. Distance based Outlier Detection
Distance-based outlier
A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
Object does not have enough neighbours
Avoids excessive computation of Statistical models
If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D
7
8. Distance based Outlier Detection
Index based Algorithm
Uses multi-dimensional indexing structures such as k-d trees and R-trees
M – maximum number of objects within dmin neighborhood
Once M+1 neighbours are found o is not an outlier
O(n2
k) apart from index construction
Nested loop algorithm
Avoids index construction
Tries to minimize I/Os
Divides memory buffer space into two halves and data set into several logical
blocks
8
9. Distance based Outlier Detection
Cell based Algorithm
Complexity : O(ck
+n) c- depends on number of cells ; k – dimensionality
Data space is partitioned into cells: dmin / 2√k
Two layers surround each cell
First layer – One cell thick
Second layer - 2√k-1 cells thick
Algorithm processes cells instead of objects
Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
If cell_+_2_layers_count, <= M then all objects in cell – Outliers
If > M some may be outliers
Object by object processing has to be done
9
10. Density based Outlier detection
Previous methods assume data are uniformly
distributed
Data may have different density distributions
Difficulty in choosing dmin
10
11. Density based Outlier detection
Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers
Considers degree to which an object is an outlier
Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings
11
12. Density based Outlier detection
The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)
k-distance neighborhood
contains every object whose distance is not greater than the MinPts (k)-
distance of p
The reachability distance of an object p with respect to object o, is
defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) }
12
14. Density based Outlier detection
Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
LOF is higher for outliers
14
15. Deviation based Outlier detection
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
15
16. Sequential exception technique
Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
subsets in the sequence
Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity
Dissimilarity function – 1/n ∑i=1
n
(xi-x’)2
Smoothing factor: Assesses how much the dissimilarity can be
reduced by removing the subset from the original set of objects
Can be repeated to avoid the influence of order
16
Deviation based Outlier detection
17. Deviation based Outlier detection
OLAP Data Cube technique
Uses data cubes to identify regions of anomalies
A cell value in a cube is an exception if it differs
significantly from an expected value
Visualization effects guide user
May drill down
17