SlideShare une entreprise Scribd logo
1  sur  29
A Brief Presentation on Data Mining
Jason Rodrigues
Data Preprocessing
• Introduction
• Why data proprocessing?
• Data Cleaning
• Data Integration and
Transformation
• Data Reduction
• Discretization and concept
Heirarchy generation
• Takeaways
Agenda
Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality

A well-accepted multi-dimensional view:

accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
Broad categories

intrinsic, contextual, representational, and accessibility
Data Preprocessing
Major Tasks of Data Preprocessing
Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration

Integration of multiple databases, data cubes, files, or notes
Data trasformation

Normalization (scaling to a specific range)

Aggregation
Data reduction

Obtains reduced representation in volume but produces the same or
similar analytical results

Data discretization: with particular importance, especially for numerical
data

Data aggregation, dimensionality reduction, data compression,
generalization
Data Preprocessing
Major Tasks of Data Preprocessing
Data Cleaning
Data Integration
Databases
Data
Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Cleaning
Tasks of Data Cleaning

Fill in missing values

Identify outliers and smooth noisy data

Correct inconsistent data
Data Cleaning
Manage Missing Data

Ignore the tuple: usually done when class label is missing (assuming the
task is classification—not effective in certain cases)

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples of the same class to fill in
the missing value: smarter

Use the most probable value to fill in the missing value: inference-
based such as regression, Bayesian formula, decision tree
Data Cleaning
Manage Noisy Data
Binning Method:

first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc
Clustering:

detect and remove outliers
Semi Automated

Computer and Manual Intervention
Regression

Use regression functions
Data Cleaning
Cluster Analysis
Data Cleaning
Regression Analysis
x
y
y = x + 1
X1
Y1
Y1’
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface
Data Cleaning
Inconsistant Data

Manual correction using external
references

Semi-automatic using various tools
− To detect violation of known functional
dependencies and data constraints
− To correct redundant data
Data integration and transformation
Tasks of Data Integration and transformation

Data integration:
− combines data from multiple sources into a coherent
store

Schema integration
− integrate metadata from different sources
− Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id ≡ B.cust-#

Detecting and resolving data value conflicts
− for the same real world entity, attribute values from
different sources are different
− possible reasons: different representations, different
scales, e.g., metric vs. British units, different currency
Manage Data Integration
Data integration and transformation

Redundant data occur often when integrating multiple DBs
− The same attribute may have different names in different databases
− One attribute may be a “derived” attribute in another table, e.g., annual
revenue

Redundant data may be able to be detected by correlational analysis
• Careful integration can help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
BA
BA
n
BBAA
r
σσ)1(
))((
,
−
−−Σ
=
Manage Data Transformation
Data integration and transformation
 Smoothing: remove noise from data (binning, clustering,
regression)
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
− min-max normalization
− z-score normalization
− normalization by decimal scaling
 Attribute/feature construction
− New attributes constructed from the given ones
Manage Data Reduction
Data reduction
Data reduction: reduced representation, while still retaining critical
information

Data cube aggregation

Dimensionality reduction

Data compression

Numerosity reduction

Discretization and concept hierarchy generation
Data Cube Aggregation
Data reduction

Multiple levels of aggregation in data cubes
− Further reduce the size of data to deal with

Reference appropriate levels Use the smallest representation capable
to solve the task
Data Compression
Data reduction

String compression
− There are extensive theories and well-tuned algorithms
− Typically lossless
− But only limited manipulation is possible without expansion

Audio/video, image compression
− Typically lossy compression, with progressive refinement
− Sometimes small fragments of signal can be reconstructed
without reconstructing the whole

Time sequence is not audio
− Typically short and vary slowly with time
``
Decision Tree
Data reduction
Similarities and Dissimilarities
Proximity

Proximity is used to refer to Similarity or
Dissimilarity, since proximity between the
object is a function of proximity between
the corresponding attributes of two objects.

Similarity: Numeric measure of the degree
to which the two objects are alike.

Dissimilarity: Numeric measure of the
degree to which the two objects are
different.
Dissimilarities between Data Objects?

Similarity
− Numerical measure of how alike two data
objects are.
− Is higher when objects are more alike.
− Often falls in the range [0,1]

Dissimilarity
− Numerical measure of how different are two data
objects
− Lower when objects are more alike
− Minimum dissimilarity is often 0
− Upper limit varies

Proximity refers to a similarity or dissimilarity
Euclidean Distance

Euclidean Distance
Where n is the number of dimensions (attributes)
and pk and qk are, respectively, the kth
attributes
(components) or data objects p and q.

Standardization is necessary, if scales differ.
∑
=
−=
n
k
kk qpdist
1
2
)(
Euclidean Distance

Euclidean Distance
∑
=
−=
n
k
kk qpdist
1
2
)(
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
0
5
4
3
Column 2
Minkowski Distance
 r = 1. City block (Manhattan, taxicab, L1
norm)
distance.
− A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors

r = 2. Euclidean distance
 r → ∞. “supremum” (Lmax
norm, L∞
norm) distance.
− This is the maximum difference between any component of the
vectors
− Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
− Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
Minkowski Distance
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Euclidean Distance Properties
• Distances, such as the Euclidean distance,
have some well known properties.
1. d(x, y) ≥ 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and q. (Symmetry)
3. d(x, y) ≤ d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
• A distance that satisfies these properties is a
metric, and a space is called a metric space
Non Metric Dissimilarities – Set Differences

non-metric measures are often robust
(resistant to outliers, errors in objects, etc.)
− the symmetry and mainly the triangular inequality
are often violated

cannot be directly
used with MAMs
a
b
a > b + c
c
a
b
a ≠ b
Non Metric Dissimilarities – Time

various k-median distances
− measure distance between the two (k-th) most
similar portions in objects

COSIMIR
− back-propagation network with single output
neuron serving as a distance, allows training

Dynamic Time Warping distance
− sequence alignment technique
− minimizes the sum of distances between
sequence elements
 fractional Lp distances
− generalization of Minkowski distances (p<1)
− more robust to extreme differences in coordinates
Jaccard Coeffificient

Recall: Jaccard coefficient is a commonly used
measure of overlap of two sets A and B
jaccard(A,B) = |A ∩ B| / |A ∪ B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A ∩ B = 0

A and B don’t have to be the same size.

JC always assigns a number between 0 and 1.
Takeaways
Why Data Preprocessing?
Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and concept Heirarchy
generation

Contenu connexe

Tendances

Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data modelmoni sindhu
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)snegacmr
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 

Tendances (20)

Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
Feature selection
Feature selectionFeature selection
Feature selection
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data preparation
Data preparationData preparation
Data preparation
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Neural network
Neural networkNeural network
Neural network
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 

En vedette

Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Image processing fundamentals
Image processing fundamentalsImage processing fundamentals
Image processing fundamentalsA B Shinde
 
Clinical Decision Support Systems
Clinical Decision Support SystemsClinical Decision Support Systems
Clinical Decision Support Systemspradhasrini
 
Clinical decision support systems
Clinical decision support systemsClinical decision support systems
Clinical decision support systemsBhavitha Pulaparthi
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssasUmar Ali
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
Application of data mining
Application of data miningApplication of data mining
Application of data miningSHIVANI SONI
 

En vedette (13)

Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Image processing fundamentals
Image processing fundamentalsImage processing fundamentals
Image processing fundamentals
 
Clinical Decision Support Systems
Clinical Decision Support SystemsClinical Decision Support Systems
Clinical Decision Support Systems
 
Clinical decision support systems
Clinical decision support systemsClinical decision support systems
Clinical decision support systems
 
Difference between molap, rolap and holap in ssas
Difference between molap, rolap and holap  in ssasDifference between molap, rolap and holap  in ssas
Difference between molap, rolap and holap in ssas
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
OLAP
OLAPOLAP
OLAP
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 

Similaire à Data preprocessing

Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 

Similaire à Data preprocessing (20)

Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data Mining
Data MiningData Mining
Data Mining
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 

Plus de Jason Rodrigues

Johari WIndow in PPT.pptx
Johari WIndow in PPT.pptxJohari WIndow in PPT.pptx
Johari WIndow in PPT.pptxJason Rodrigues
 
Paris Conference on Applied Psychology
Paris Conference on Applied PsychologyParis Conference on Applied Psychology
Paris Conference on Applied PsychologyJason Rodrigues
 
Design and documentation of software architectures
Design and documentation of software architecturesDesign and documentation of software architectures
Design and documentation of software architecturesJason Rodrigues
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data miningJason Rodrigues
 
A Sales Approach For Cloud Computing
A Sales Approach For Cloud ComputingA Sales Approach For Cloud Computing
A Sales Approach For Cloud ComputingJason Rodrigues
 

Plus de Jason Rodrigues (9)

Johari WIndow in PPT.pptx
Johari WIndow in PPT.pptxJohari WIndow in PPT.pptx
Johari WIndow in PPT.pptx
 
Startup and incubation
Startup and incubationStartup and incubation
Startup and incubation
 
Paris Conference on Applied Psychology
Paris Conference on Applied PsychologyParis Conference on Applied Psychology
Paris Conference on Applied Psychology
 
Rodrigues
RodriguesRodrigues
Rodrigues
 
Safety Presentation
Safety PresentationSafety Presentation
Safety Presentation
 
Design and documentation of software architectures
Design and documentation of software architecturesDesign and documentation of software architectures
Design and documentation of software architectures
 
Wrap up
Wrap upWrap up
Wrap up
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data mining
 
A Sales Approach For Cloud Computing
A Sales Approach For Cloud ComputingA Sales Approach For Cloud Computing
A Sales Approach For Cloud Computing
 

Dernier

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Dernier (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Data preprocessing

  • 1. A Brief Presentation on Data Mining Jason Rodrigues Data Preprocessing
  • 2. • Introduction • Why data proprocessing? • Data Cleaning • Data Integration and Transformation • Data Reduction • Discretization and concept Heirarchy generation • Takeaways Agenda
  • 3. Why Data Preprocessing? Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  noisy: containing errors or outliers  inconsistent: containing discrepancies in codes or names No quality data, no quality mining results!  Quality decisions must be based on quality data  Data warehouse needs consistent integration of quality data A multi-dimensional measure of data quality  A well-accepted multi-dimensional view:  accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility Broad categories  intrinsic, contextual, representational, and accessibility
  • 4. Data Preprocessing Major Tasks of Data Preprocessing Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration  Integration of multiple databases, data cubes, files, or notes Data trasformation  Normalization (scaling to a specific range)  Aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization: with particular importance, especially for numerical data  Data aggregation, dimensionality reduction, data compression, generalization
  • 5. Data Preprocessing Major Tasks of Data Preprocessing Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 6. Data Cleaning Tasks of Data Cleaning  Fill in missing values  Identify outliers and smooth noisy data  Correct inconsistent data
  • 7. Data Cleaning Manage Missing Data  Ignore the tuple: usually done when class label is missing (assuming the task is classification—not effective in certain cases)  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples of the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference- based such as regression, Bayesian formula, decision tree
  • 8. Data Cleaning Manage Noisy Data Binning Method:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc Clustering:  detect and remove outliers Semi Automated  Computer and Manual Intervention Regression  Use regression functions
  • 10. Data Cleaning Regression Analysis x y y = x + 1 X1 Y1 Y1’ •Linear regression (best line to fit two variables) •Multiple linear regression (more than two variables, fit to a multidimensional surface
  • 11. Data Cleaning Inconsistant Data  Manual correction using external references  Semi-automatic using various tools − To detect violation of known functional dependencies and data constraints − To correct redundant data
  • 12. Data integration and transformation Tasks of Data Integration and transformation  Data integration: − combines data from multiple sources into a coherent store  Schema integration − integrate metadata from different sources − Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#  Detecting and resolving data value conflicts − for the same real world entity, attribute values from different sources are different − possible reasons: different representations, different scales, e.g., metric vs. British units, different currency
  • 13. Manage Data Integration Data integration and transformation  Redundant data occur often when integrating multiple DBs − The same attribute may have different names in different databases − One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant data may be able to be detected by correlational analysis • Careful integration can help reduce/avoid redundancies and inconsistencies and improve mining speed and quality BA BA n BBAA r σσ)1( ))(( , − −−Σ =
  • 14. Manage Data Transformation Data integration and transformation  Smoothing: remove noise from data (binning, clustering, regression)  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range − min-max normalization − z-score normalization − normalization by decimal scaling  Attribute/feature construction − New attributes constructed from the given ones
  • 15. Manage Data Reduction Data reduction Data reduction: reduced representation, while still retaining critical information  Data cube aggregation  Dimensionality reduction  Data compression  Numerosity reduction  Discretization and concept hierarchy generation
  • 16. Data Cube Aggregation Data reduction  Multiple levels of aggregation in data cubes − Further reduce the size of data to deal with  Reference appropriate levels Use the smallest representation capable to solve the task
  • 17. Data Compression Data reduction  String compression − There are extensive theories and well-tuned algorithms − Typically lossless − But only limited manipulation is possible without expansion  Audio/video, image compression − Typically lossy compression, with progressive refinement − Sometimes small fragments of signal can be reconstructed without reconstructing the whole  Time sequence is not audio − Typically short and vary slowly with time ``
  • 19. Similarities and Dissimilarities Proximity  Proximity is used to refer to Similarity or Dissimilarity, since proximity between the object is a function of proximity between the corresponding attributes of two objects.  Similarity: Numeric measure of the degree to which the two objects are alike.  Dissimilarity: Numeric measure of the degree to which the two objects are different.
  • 20. Dissimilarities between Data Objects?  Similarity − Numerical measure of how alike two data objects are. − Is higher when objects are more alike. − Often falls in the range [0,1]  Dissimilarity − Numerical measure of how different are two data objects − Lower when objects are more alike − Minimum dissimilarity is often 0 − Upper limit varies  Proximity refers to a similarity or dissimilarity
  • 21. Euclidean Distance  Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.  Standardization is necessary, if scales differ. ∑ = −= n k kk qpdist 1 2 )(
  • 22. Euclidean Distance  Euclidean Distance ∑ = −= n k kk qpdist 1 2 )( 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 0 5 4 3 Column 2
  • 23. Minkowski Distance  r = 1. City block (Manhattan, taxicab, L1 norm) distance. − A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors  r = 2. Euclidean distance  r → ∞. “supremum” (Lmax norm, L∞ norm) distance. − This is the maximum difference between any component of the vectors − Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ?? − Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
  • 24. Minkowski Distance point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L∞ p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0
  • 25. Euclidean Distance Properties • Distances, such as the Euclidean distance, have some well known properties. 1. d(x, y) ≥ 0 for all x and y and d(x, y) = 0 only if x = y. (Positive definiteness) 2. d(x, y) = d(y, x) for all x and q. (Symmetry) 3. d(x, y) ≤ d(x, y) + d(y, z) for all points x, y, and z. (Triangle Inequality) where d(x, y) is the distance (dissimilarity) between points (data objects), x and y. • A distance that satisfies these properties is a metric, and a space is called a metric space
  • 26. Non Metric Dissimilarities – Set Differences  non-metric measures are often robust (resistant to outliers, errors in objects, etc.) − the symmetry and mainly the triangular inequality are often violated  cannot be directly used with MAMs a b a > b + c c a b a ≠ b
  • 27. Non Metric Dissimilarities – Time  various k-median distances − measure distance between the two (k-th) most similar portions in objects  COSIMIR − back-propagation network with single output neuron serving as a distance, allows training  Dynamic Time Warping distance − sequence alignment technique − minimizes the sum of distances between sequence elements  fractional Lp distances − generalization of Minkowski distances (p<1) − more robust to extreme differences in coordinates
  • 28. Jaccard Coeffificient  Recall: Jaccard coefficient is a commonly used measure of overlap of two sets A and B jaccard(A,B) = |A ∩ B| / |A ∪ B| jaccard(A,A) = 1 jaccard(A,B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.  JC always assigns a number between 0 and 1.
  • 29. Takeaways Why Data Preprocessing? Data Cleaning Data Integration and Transformation Data Reduction Discretization and concept Heirarchy generation