SlideShare une entreprise Scribd logo
1  sur  17
Outlier Analysis
1
Outlier Analysis
 Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
 Causes
 Measurement / Execution errors
 Inherent data variability
 Outliers – maybe valuable patterns
 Fraud detection
 Customized marketing
 Medical Analysis
2
Outlier Mining
 Given n data points and k – expected number of
outliers find the top k dissimilar objects
 Define inconsistent data
 Residuals in Regression
 Difficulties – Multi-dimensional data, non-numeric data
 Mine the outliers
 Visualization based methods
 Not applicable to cyclic plots, high dimensional data and categorical data
 Approaches
 Statistical Approach
 Distance-based approach
 Density based outlier approach
 Deviation based approach
3
Statistical Distribution-based Outlier
detection
 Assumes data follows a probability distribution and uses
discordancy test
 Discordancy testing
 Working hypothesis – H: oi ∈ F i=1,2,..n
 Test verifies whether an object oi is significantly different from F
 Significance probability SP(vi) = Prob(T>vi)
 IF SP is small oi is discordant and working hypothesis is rejected
and alternate hypothesis that oi comes from another distribution
model G is adopted
4
Statistical Distribution-based Outlier
detection
 Alternative distributions
 Inherent alternative distribution
 Alternative hypothesis: All objects arise from another distribution G
 Mixture alternative distribution
 Discordant values are not outliers but contaminants from G H’: oi ∈ (1-
λ) F + λG i=1,2,..n
 Slippage alternative distribution
 Some Objects are independent observations from a modified version
of F (different parameters)
5
Statistical Distribution-based Outlier
detection
 Procedures for detecting Outliers
 Block procedures
 All are outliers or all are consistent
 Consecutive Procedures
 Inside-out procedure: Least likely object is tested first
 If it is an outlier – more extreme values are also considered as outliers
 Disadvantages of Statistical Approach
 Tests are for single attributes
 Data distribution may not be known
6
Distance based Outlier Detection
 Distance-based outlier
 A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
 Object does not have enough neighbours
 Avoids excessive computation of Statistical models
 If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D
7
Distance based Outlier Detection
 Index based Algorithm
 Uses multi-dimensional indexing structures such as k-d trees and R-trees
 M – maximum number of objects within dmin neighborhood
 Once M+1 neighbours are found o is not an outlier
 O(n2
k) apart from index construction
 Nested loop algorithm
 Avoids index construction
 Tries to minimize I/Os
 Divides memory buffer space into two halves and data set into several logical
blocks
8
Distance based Outlier Detection
 Cell based Algorithm
 Complexity : O(ck
+n) c- depends on number of cells ; k – dimensionality
 Data space is partitioned into cells: dmin / 2√k
 Two layers surround each cell
 First layer – One cell thick
 Second layer -  2√k-1  cells thick
 Algorithm processes cells instead of objects
 Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
 An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
 If cell_+_2_layers_count, <= M then all objects in cell – Outliers
 If > M some may be outliers
 Object by object processing has to be done
9
Density based Outlier detection
 Previous methods assume data are uniformly
distributed
 Data may have different density distributions
 Difficulty in choosing dmin
10
Density based Outlier detection
 Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
 O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers
 Considers degree to which an object is an outlier
 Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings
11
Density based Outlier detection
 The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
 there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
 there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)
 k-distance neighborhood
 contains every object whose distance is not greater than the MinPts (k)-
distance of p
 The reachability distance of an object p with respect to object o, is
defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) }
12
OPTICS
 Complexity : O(n log n)
13
Density based Outlier detection
 Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
 Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
 It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
 LOF is higher for outliers
14
Deviation based Outlier detection
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
15
 Sequential exception technique
 Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
subsets in the sequence
 Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity
 Dissimilarity function – 1/n ∑i=1
n
(xi-x’)2
 Smoothing factor: Assesses how much the dissimilarity can be
reduced by removing the subset from the original set of objects
 Can be repeated to avoid the influence of order
16
Deviation based Outlier detection
Deviation based Outlier detection
 OLAP Data Cube technique
 Uses data cubes to identify regions of anomalies
 A cell value in a cube is an exception if it differs
significantly from an expected value
 Visualization effects guide user
 May drill down
17

Contenu connexe

Tendances

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Tendances (20)

01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 

En vedette

OS Lab: Introduction to Linux
OS Lab: Introduction to LinuxOS Lab: Introduction to Linux
OS Lab: Introduction to Linux
Motaz Saad
 
Open Source Business Models
Open Source Business ModelsOpen Source Business Models
Open Source Business Models
Motaz Saad
 

En vedette (17)

OS Lab: Introduction to Linux
OS Lab: Introduction to LinuxOS Lab: Introduction to Linux
OS Lab: Introduction to Linux
 
مقدمة في تكنواوجيا المعلومات
مقدمة في تكنواوجيا المعلوماتمقدمة في تكنواوجيا المعلومات
مقدمة في تكنواوجيا المعلومات
 
Cross Language Concept Mining
Cross Language Concept Mining Cross Language Concept Mining
Cross Language Concept Mining
 
Hewahi, saad 2006 - class outliers mining distance-based approach
Hewahi, saad   2006 - class outliers mining distance-based approachHewahi, saad   2006 - class outliers mining distance-based approach
Hewahi, saad 2006 - class outliers mining distance-based approach
 
Intel 64bit Architecture
Intel 64bit ArchitectureIntel 64bit Architecture
Intel 64bit Architecture
 
Assembly Language Lecture 5
Assembly Language Lecture 5Assembly Language Lecture 5
Assembly Language Lecture 5
 
Browsing The Source Code of Linux Packages
Browsing The Source Code of Linux PackagesBrowsing The Source Code of Linux Packages
Browsing The Source Code of Linux Packages
 
Class Outlier Mining
Class Outlier MiningClass Outlier Mining
Class Outlier Mining
 
Browsing Linux Kernel Source
Browsing Linux Kernel SourceBrowsing Linux Kernel Source
Browsing Linux Kernel Source
 
The x86 Family
The x86 FamilyThe x86 Family
The x86 Family
 
Open Source Business Models
Open Source Business ModelsOpen Source Business Models
Open Source Business Models
 
Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Assembly Language Lecture 4
Assembly Language Lecture 4Assembly Language Lecture 4
Assembly Language Lecture 4
 
Assembly Language Lecture 3
Assembly Language Lecture 3Assembly Language Lecture 3
Assembly Language Lecture 3
 
Structured Vs, Object Oriented Analysis and Design
Structured Vs, Object Oriented Analysis and DesignStructured Vs, Object Oriented Analysis and Design
Structured Vs, Object Oriented Analysis and Design
 
Introduction to CLIPS Expert System
Introduction to CLIPS Expert SystemIntroduction to CLIPS Expert System
Introduction to CLIPS Expert System
 

Similaire à 3.7 outlier analysis

Data Mining Anomaly DetectionLecture Notes for Chapt.docx
Data Mining Anomaly DetectionLecture Notes for Chapt.docxData Mining Anomaly DetectionLecture Notes for Chapt.docx
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
randyburney60861
 

Similaire à 3.7 outlier analysis (20)

Chap10 Anomaly Detection
Chap10 Anomaly DetectionChap10 Anomaly Detection
Chap10 Anomaly Detection
 
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
Data Mining Anomaly DetectionLecture Notes for Chapt.docxData Mining Anomaly DetectionLecture Notes for Chapt.docx
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
 
Chapter 10 Anomaly Detection
Chapter 10 Anomaly DetectionChapter 10 Anomaly Detection
Chapter 10 Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Anomaly Detection
Anomaly DetectionAnomaly Detection
Anomaly Detection
 
Outlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional DataOutlier Detection Using Unsupervised Learning on High Dimensional Data
Outlier Detection Using Unsupervised Learning on High Dimensional Data
 
Data wrangling week 10
Data wrangling week 10Data wrangling week 10
Data wrangling week 10
 
Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptChapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt
 
12 outlier
12 outlier12 outlier
12 outlier
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Anomaly Detection in DataMining
Anomaly Detection in DataMiningAnomaly Detection in DataMining
Anomaly Detection in DataMining
 
Cluster
ClusterCluster
Cluster
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.ppt
 
Kdd08 abod
Kdd08 abodKdd08 abod
Kdd08 abod
 
angle based outlier de
angle based outlier deangle based outlier de
angle based outlier de
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Local Outlier Detection with Interpretation
Local Outlier Detection with InterpretationLocal Outlier Detection with Interpretation
Local Outlier Detection with Interpretation
 

Plus de Krish_ver2

Plus de Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 

Dernier

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 

Dernier (20)

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 

3.7 outlier analysis

  • 2. Outlier Analysis  Outlier – data objects that are grossly different from or inconsistent with the remaining set of data  Causes  Measurement / Execution errors  Inherent data variability  Outliers – maybe valuable patterns  Fraud detection  Customized marketing  Medical Analysis 2
  • 3. Outlier Mining  Given n data points and k – expected number of outliers find the top k dissimilar objects  Define inconsistent data  Residuals in Regression  Difficulties – Multi-dimensional data, non-numeric data  Mine the outliers  Visualization based methods  Not applicable to cyclic plots, high dimensional data and categorical data  Approaches  Statistical Approach  Distance-based approach  Density based outlier approach  Deviation based approach 3
  • 4. Statistical Distribution-based Outlier detection  Assumes data follows a probability distribution and uses discordancy test  Discordancy testing  Working hypothesis – H: oi ∈ F i=1,2,..n  Test verifies whether an object oi is significantly different from F  Significance probability SP(vi) = Prob(T>vi)  IF SP is small oi is discordant and working hypothesis is rejected and alternate hypothesis that oi comes from another distribution model G is adopted 4
  • 5. Statistical Distribution-based Outlier detection  Alternative distributions  Inherent alternative distribution  Alternative hypothesis: All objects arise from another distribution G  Mixture alternative distribution  Discordant values are not outliers but contaminants from G H’: oi ∈ (1- λ) F + λG i=1,2,..n  Slippage alternative distribution  Some Objects are independent observations from a modified version of F (different parameters) 5
  • 6. Statistical Distribution-based Outlier detection  Procedures for detecting Outliers  Block procedures  All are outliers or all are consistent  Consecutive Procedures  Inside-out procedure: Least likely object is tested first  If it is an outlier – more extreme values are also considered as outliers  Disadvantages of Statistical Approach  Tests are for single attributes  Data distribution may not be known 6
  • 7. Distance based Outlier Detection  Distance-based outlier  A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O  Object does not have enough neighbours  Avoids excessive computation of Statistical models  If an object is an outlier according to a discordancy test then o is DB(p, D) outlier for some p and D 7
  • 8. Distance based Outlier Detection  Index based Algorithm  Uses multi-dimensional indexing structures such as k-d trees and R-trees  M – maximum number of objects within dmin neighborhood  Once M+1 neighbours are found o is not an outlier  O(n2 k) apart from index construction  Nested loop algorithm  Avoids index construction  Tries to minimize I/Os  Divides memory buffer space into two halves and data set into several logical blocks 8
  • 9. Distance based Outlier Detection  Cell based Algorithm  Complexity : O(ck +n) c- depends on number of cells ; k – dimensionality  Data space is partitioned into cells: dmin / 2√k  Two layers surround each cell  First layer – One cell thick  Second layer -  2√k-1  cells thick  Algorithm processes cells instead of objects  Maintains three counts: cell_count, cell_+_1_layer_count, cell_+_2_layers_count  An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no objects in the cell are outliers  If cell_+_2_layers_count, <= M then all objects in cell – Outliers  If > M some may be outliers  Object by object processing has to be done 9
  • 10. Density based Outlier detection  Previous methods assume data are uniformly distributed  Data may have different density distributions  Difficulty in choosing dmin 10
  • 11. Density based Outlier detection  Local Outlier – if its outlying relative to its local neighbourhood particularly wrt the density of the neighborhood  O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects in C1 are treated as outliers  Considers degree to which an object is an outlier  Local Outlier factor – degree depends on how isolated the object is wrt its surroundings 11
  • 12. Density based Outlier detection  The k-distance of an object p is the maximal distance that p gets from its k-nearest neighbors d(p, o)  there are at least k objects in D that are as close as or closer to p than o; for k o’ d(p, o’) <= d(p, o)  there are at most k-1 objects that are closer to p than o; for k-1 o” d(p, o”) < d(p, o)  k-distance neighborhood  contains every object whose distance is not greater than the MinPts (k)- distance of p  The reachability distance of an object p with respect to object o, is defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) } 12
  • 13. OPTICS  Complexity : O(n log n) 13
  • 14. Density based Outlier detection  Local reachability density of p is the inverse of the average reachability density based on the MinPts- nearest neighbors of p.  Local outlier factor (LOF) of p captures the degree to which we call p an outlier.  It is the average of the ratio of the local reachability density of p and those of p’s MinPts-nearest neighbors.  LOF is higher for outliers 14
  • 15. Deviation based Outlier detection  Identifies outliers by examining the main characteristics of objects in a group  Objects that “deviate” from this description are considered outliers  Sequential exception technique  Simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects 15
  • 16.  Sequential exception technique  Given a data set D a sequence of subsets {D1, D2, ..Dm} is built such that Dj-1 ⊆ Dj; Dissimilarities are assessed between subsets in the sequence  Exception Set – Smallest subset of objects whose removal results in greatest reduction of dissimilarity  Dissimilarity function – 1/n ∑i=1 n (xi-x’)2  Smoothing factor: Assesses how much the dissimilarity can be reduced by removing the subset from the original set of objects  Can be repeated to avoid the influence of order 16 Deviation based Outlier detection
  • 17. Deviation based Outlier detection  OLAP Data Cube technique  Uses data cubes to identify regions of anomalies  A cell value in a cube is an exception if it differs significantly from an expected value  Visualization effects guide user  May drill down 17