SlideShare une entreprise Scribd logo
1  sur  43
Types of Data &
Data Preprocessing
Prof. Navneet Goyal
Department of Computer Science &
Information Systems
BITS, Pilani
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
Why Preprocess Data?
 Data in the real world is dirty
 Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in
codes or names
 Welcome to the real world!
 No quality data, no quality mining results!
 Quality decisions must be based on quality
data
Understanding Your Data





Descriptive Data summarization
Foundation for data processing
Central tendency: Mean, Mode, Median
Data Dispersion: Quartiles, Interquartile
range (IQR), Variance

 Distributive Measure:
sum,count,max,min
 Algebraic Measure: algebraic fn. On
one or more distributive measure
 Example: average weighted average
Understanding Your Data
 Mean is sensitive to extreme values
 Solution: Trimming
 For skewed data: median is a better
measure (middle values of ordered
set)
 Holistic measure: cannot be
computed by data partitioning
 Example: median
 Computationally more expensive
Understanding Your Data
Mode: most frequently occurring data value
Unimodal, bimodal, trimodal, multimodal
No mode!!
Dispersion of data: range, quartile, outliers
Range=max-min
 kth percentile of a set of data in numerical order is the
value xi having the property that k% of data values lie
at or below xi
th
 Median is 50 percentile
 Quartiles: Q1(25th percentile), Q3 (75th percentile)
 Give idea about center, spread, & shape of distribution
 IQR = Q3 – Q1 (all holistic measures)





Understanding Your Data
 Outliers: single out values falling at least
1.5 x IQR above Q3 or below Q1
 Which of the measures discussed so far are
one or more of the data values?
 5-member summary:
minimum, Q1, Median, Q3, maximum
(since Q1, Median, Q3 contain no information about the tails)

 Boxplots
 Variance & Std. Dev.
 Interpret σ=0 & σ>0
Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies

 Data integration
 Integration of multiple databases, data
cubes, or files

 Data transformation
 Normalization and aggregation
Major Tasks in Data
Preprocessing
 Data reduction (sampling)
 Obtains reduced representation in volume but
produces the same or similar analytical
results

 Data discretization
 Part of data reduction but with particular
importance, especially for numerical data
Forms of data preprocessing

Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
Data Cleaning
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy
data
 Correct inconsistent data
Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data

 Missing data may need to be inferred.
How to Handle
Missing Data?


Ignore the tuple: usually done when class label is missing
(assuming the tasks is classification—not effective when the
percentage of missing values per attribute varies considerably)



Fill in the missing value manually: tedious + infeasible?



Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!



Use the attribute mean to fill in the missing value



Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter



Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to






faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

 Smooth out the data to remove noise
Smoothing Techniques
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

 Clustering
 detect and remove outliers

 Combined computer and human inspection
 detect suspicious values and check by human

 Regression
 smooth by fitting the data into regression functions
Binning
 Binning methods smooth a sorted data
value by consulting its neighborhood,
that is, values around it
 Sorted values are distributed into a
number of ‘buckets’ or ‘bins’
 Binning does local smoothing
 Different binning methods illustrated by
an example
 Also used as data discretization tech.
Simple Discretization Methods:
Binning
 Equal-width (distance) partitioning:

 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
 Most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
 Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression
y
Y1

y=x+1

Y1’

X1

x
Cluster Analysis
Data Smoothing &
Reduction
 Many methods discussed above for data
smoothing are also methods for data
reduction involving discretization
 For eg. Binning reduces the number of
distinct values per attribute ( a form of
data reduction for logic-based data
mining methods such a decision tree
induction
Data Integration
 Data integration:
 combines data from multiple sources into a coherent
store

 Schema integration
 integrate metadata from different sources
 Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id ≡
B.cust-#

 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations, different
scales, e.g., metric vs. British units
Handling Redundant Data
in Data Integration
 Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in another
table, e.g., annual revenue

 Redundant data may be detected by correlation
analysis (Pearson’s Correlation Coefficient)
 Correlation does not imply Causality
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Attribute/feature construction
 New attributes constructed from the given ones
Data Transformation:
Normalization
 min-max normalization
v − minA
v' =
(new _ maxA − new _ minA) + new _ minA
maxA − minA
 z-score normalization (zero-mean)
v −meanA
v' =
stand _ devA
 normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(| v ' |)<1
10
Data Transformation:
Attribute Construction
 New attributes are constructed from given
attributes and added
 Improves accuracy
 Helps in understanding of structure in higdimensional data
 For eg. Add area based on attributes height
& width
 Knowing about relationships among
attributes help in knowledge discovery
Data Reduction Strategies
 Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that
is much smaller in volume but yet produces the
same (or almost the same) analytical results

 Data reduction strategies






Data cube aggregation
Attribute subset selection (feature subset selection)
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
Data Cube Aggregation
 Cube at the lowest level of abstraction – base cuboid
 Cube at highest level of abstraction – apex cuboid
 Cubes are created at various levels of abstraction,
depending upon the analysis task – cubiods
 Cube is a lattice of cuboids
 Data volume reduces as we move up from base to apex
cubiod
 While doing data mining, the smallest available cuboid
relevant to the given task should be used
 Cube aggregation gives smaller data without loss of
information necessary for the analysis task
Attribute subset selection
 Also called Feature subset selection
 Leave out irrelevant attributes and pick only relevant
attributes
 Difficult and time consuming process
 Reduces the data size by removing irrelevant or
redundant attributes (dimensions)
 Goal is to select a minimum set of features such that the
resulting probability distribution of data classes is as close as
possible to the original distribution given the values of all
features
 Additional benefit: less attributes appear in discovered
patterns, making interpretation easier
Attribute subset selection
 How to select a good representative subset?
 For N attributes, 2N possible subsets
 Heuristic methods (due to exponential # of
choices)
 Heuristic methods that explore a reduced
search space are generally used
 Greedy algorithms
 Heuristic methods:
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward
elimination
 decision-tree induction
Example of Decision Tree
Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?

A1?

Class 1
>

Class 2

Class 1

Reduced attribute set: {A1, A4, A6}

Class 2
Wavelet Transforms

Haar2

Daubechie4

 Discrete wavelet transform (DWT): linear signal
processing
 Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
 Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
 Method:


Length, L, must be an integer power of 2 (padding with 0s, when
necessary)



Each transform has 2 functions: smoothing, difference



Applies to pairs of data, resulting in two set of data of length L/2



Applies two functions recursively, until reaches the desired length

Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
Principal Component
Analysis
 Given N data vectors from k-dimensions,
find c <= k orthogonal vectors that can be
best used to represent data
 The original data set is reduced to one consisting
of N data vectors on c principal components
(reduced dimensions)

 Each data vector is a linear combination of
the c principal component vectors
 Works for numeric data only
 Used when the number of dimensions is
large
Principal Component
Analysis
X2
Y1
Y2

X1
Numerosity Reduction
 Can we reduce the data volume by
choosing alternative ‘smaller forms of
data representation?
 Techniques:
 Parametric
 Non-parametric methods
Numerosity Reduction
 Parametric methods
 Assume the data fits some model, estimate
model parameters, store only the parameters,
and discard the data (except possible outliers)
 Log-linear models

 Non-parametric methods
 Do not assume models. Stores reduced
representations of the data
 Major families: histograms, clustering, sampling
Regression and LogLinear Models
 Linear regression: Data are modeled to fit a straight
line
 Often uses the least-square method to fit the line

 Multiple regression: allows a response variable Y to
be modeled as a linear function of multidimensional
feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions
Regress Analysis and
Log-Linear Models
 Linear regression: Y = α + β X
 Two parameters , α and β specify the line and are
to be estimated by using the data at hand.
 using the least squares criterion to the known
values of Y1, Y2, …, X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2.
 Many nonlinear functions can be transformed into
the above.
 Log-linear models:
 The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
 Probability: p(a, b, c, d) = αab βacχad δbcd
Histograms
40
 A popular data
reduction technique
35
 Divide data into
30
buckets and store
average (sum) for each
25
bucket
20
 Can be constructed
optimally in one
15
dimension using
dynamic programming 10
 Related to quantization 5
problems.
0

10000

30000

50000

70000

90000
Clustering
 Partition data set into clusters, and one can store
cluster representation only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms, further detailed in Chapter 8
Sampling
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew

 Develop adaptive sampling methods
 Stratified sampling:

 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data

 Sampling may not reduce database I/Os (page at a
time).
Sampling

WOR ndom
SRS le ra
t
simp e withou
(
l
samp ment)
e
eplac
r

SRSW
R

Raw Data
Sampling
Raw Data

Cluster/Stratified Sample

Contenu connexe

Tendances

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine LearningPyingkodi Maran
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 

Tendances (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Data preparation
Data preparationData preparation
Data preparation
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data preprocessing in Machine Learning
Data preprocessing in Machine LearningData preprocessing in Machine Learning
Data preprocessing in Machine Learning
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 

En vedette

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learningdrcfetr
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisAmazon Web Services
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data miningDataminingTools Inc
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDataminingTools Inc
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

En vedette (20)

03. Data Preprocessing
03. Data Preprocessing03. Data Preprocessing
03. Data Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 
Clustering
ClusteringClustering
Clustering
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Data mining
Data miningData mining
Data mining
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data mining
Data miningData mining
Data mining
 

Similaire à Data preprocessing ng

Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingJames Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingFraboni Ec
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHoang Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingYoung Alista
 

Similaire à Data preprocessing ng (20)

Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocess
PreprocessPreprocess
Preprocess
 

Dernier

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 

Dernier (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

Data preprocessing ng

  • 1. Types of Data & Data Preprocessing Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani
  • 2. Data Preprocessing  Why preprocess the data?  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary
  • 3. Why Preprocess Data?  Data in the real world is dirty  Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  Noisy: containing errors or outliers  Inconsistent: containing discrepancies in codes or names  Welcome to the real world!  No quality data, no quality mining results!  Quality decisions must be based on quality data
  • 4. Understanding Your Data     Descriptive Data summarization Foundation for data processing Central tendency: Mean, Mode, Median Data Dispersion: Quartiles, Interquartile range (IQR), Variance  Distributive Measure: sum,count,max,min  Algebraic Measure: algebraic fn. On one or more distributive measure  Example: average weighted average
  • 5. Understanding Your Data  Mean is sensitive to extreme values  Solution: Trimming  For skewed data: median is a better measure (middle values of ordered set)  Holistic measure: cannot be computed by data partitioning  Example: median  Computationally more expensive
  • 6. Understanding Your Data Mode: most frequently occurring data value Unimodal, bimodal, trimodal, multimodal No mode!! Dispersion of data: range, quartile, outliers Range=max-min  kth percentile of a set of data in numerical order is the value xi having the property that k% of data values lie at or below xi th  Median is 50 percentile  Quartiles: Q1(25th percentile), Q3 (75th percentile)  Give idea about center, spread, & shape of distribution  IQR = Q3 – Q1 (all holistic measures)     
  • 7. Understanding Your Data  Outliers: single out values falling at least 1.5 x IQR above Q3 or below Q1  Which of the measures discussed so far are one or more of the data values?  5-member summary: minimum, Q1, Median, Q3, maximum (since Q1, Median, Q3 contain no information about the tails)  Boxplots  Variance & Std. Dev.  Interpret σ=0 & σ>0
  • 8. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation
  • 9. Major Tasks in Data Preprocessing  Data reduction (sampling)  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data
  • 10. Forms of data preprocessing Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
  • 11. Data Cleaning  Data cleaning tasks  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data
  • 12. Missing Data  Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred.
  • 13. How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (assuming the tasks is classification—not effective when the percentage of missing values per attribute varies considerably)  Fill in the missing value manually: tedious + infeasible?  Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!  Use the attribute mean to fill in the missing value  Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter  Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 14. Noisy Data  Noise: random error or variance in a measured variable  Incorrect attribute values may be due to      faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention  Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data  Smooth out the data to remove noise
  • 15. Smoothing Techniques  Binning method:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human  Regression  smooth by fitting the data into regression functions
  • 16. Binning  Binning methods smooth a sorted data value by consulting its neighborhood, that is, values around it  Sorted values are distributed into a number of ‘buckets’ or ‘bins’  Binning does local smoothing  Different binning methods illustrated by an example  Also used as data discretization tech.
  • 17. Simple Discretization Methods: Binning  Equal-width (distance) partitioning:  It divides the range into N intervals of equal size: uniform grid  if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.  Most straightforward  But outliers may dominate presentation  Skewed data is not handled well.  Equal-depth (frequency) partitioning:  It divides the range into N intervals, each containing approximately same number of samples  Good data scaling  Managing categorical attributes can be tricky.
  • 18. Binning Methods for Data Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34  Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29  Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 21. Data Smoothing & Reduction  Many methods discussed above for data smoothing are also methods for data reduction involving discretization  For eg. Binning reduces the number of distinct values per attribute ( a form of data reduction for logic-based data mining methods such a decision tree induction
  • 22. Data Integration  Data integration:  combines data from multiple sources into a coherent store  Schema integration  integrate metadata from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-#  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different  possible reasons: different representations, different scales, e.g., metric vs. British units
  • 23. Handling Redundant Data in Data Integration  Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databases  One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant data may be detected by correlation analysis (Pearson’s Correlation Coefficient)  Correlation does not imply Causality  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 24. Data Transformation  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization  z-score normalization  normalization by decimal scaling  Attribute/feature construction  New attributes constructed from the given ones
  • 25. Data Transformation: Normalization  min-max normalization v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA  z-score normalization (zero-mean) v −meanA v' = stand _ devA  normalization by decimal scaling v v' = j Where j is the smallest integer such that Max(| v ' |)<1 10
  • 26. Data Transformation: Attribute Construction  New attributes are constructed from given attributes and added  Improves accuracy  Helps in understanding of structure in higdimensional data  For eg. Add area based on attributes height & width  Knowing about relationships among attributes help in knowledge discovery
  • 27. Data Reduction Strategies  Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set  Data reduction  Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results  Data reduction strategies      Data cube aggregation Attribute subset selection (feature subset selection) Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
  • 28. Data Cube Aggregation  Cube at the lowest level of abstraction – base cuboid  Cube at highest level of abstraction – apex cuboid  Cubes are created at various levels of abstraction, depending upon the analysis task – cubiods  Cube is a lattice of cuboids  Data volume reduces as we move up from base to apex cubiod  While doing data mining, the smallest available cuboid relevant to the given task should be used  Cube aggregation gives smaller data without loss of information necessary for the analysis task
  • 29. Attribute subset selection  Also called Feature subset selection  Leave out irrelevant attributes and pick only relevant attributes  Difficult and time consuming process  Reduces the data size by removing irrelevant or redundant attributes (dimensions)  Goal is to select a minimum set of features such that the resulting probability distribution of data classes is as close as possible to the original distribution given the values of all features  Additional benefit: less attributes appear in discovered patterns, making interpretation easier
  • 30. Attribute subset selection  How to select a good representative subset?  For N attributes, 2N possible subsets  Heuristic methods (due to exponential # of choices)  Heuristic methods that explore a reduced search space are generally used  Greedy algorithms  Heuristic methods:  step-wise forward selection  step-wise backward elimination  combining forward selection and backward elimination  decision-tree induction
  • 31. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 1 > Class 2 Class 1 Reduced attribute set: {A1, A4, A6} Class 2
  • 32. Wavelet Transforms Haar2 Daubechie4  Discrete wavelet transform (DWT): linear signal processing  Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients  Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space  Method:  Length, L, must be an integer power of 2 (padding with 0s, when necessary)  Each transform has 2 functions: smoothing, difference  Applies to pairs of data, resulting in two set of data of length L/2  Applies two functions recursively, until reaches the desired length Figure taken from Han & kamber Book: Data Mining Concepts & Techniques, 2e
  • 33. Principal Component Analysis  Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data  The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions)  Each data vector is a linear combination of the c principal component vectors  Works for numeric data only  Used when the number of dimensions is large
  • 35. Numerosity Reduction  Can we reduce the data volume by choosing alternative ‘smaller forms of data representation?  Techniques:  Parametric  Non-parametric methods
  • 36. Numerosity Reduction  Parametric methods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Log-linear models  Non-parametric methods  Do not assume models. Stores reduced representations of the data  Major families: histograms, clustering, sampling
  • 37. Regression and LogLinear Models  Linear regression: Data are modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector  Log-linear model: approximates discrete multidimensional probability distributions
  • 38. Regress Analysis and Log-Linear Models  Linear regression: Y = α + β X  Two parameters , α and β specify the line and are to be estimated by using the data at hand.  using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Multiple regression: Y = b0 + b1 X1 + b2 X2.  Many nonlinear functions can be transformed into the above.  Log-linear models:  The multi-way table of joint probabilities is approximated by a product of lower-order tables.  Probability: p(a, b, c, d) = αab βacχad δbcd
  • 39. Histograms 40  A popular data reduction technique 35  Divide data into 30 buckets and store average (sum) for each 25 bucket 20  Can be constructed optimally in one 15 dimension using dynamic programming 10  Related to quantization 5 problems. 0 10000 30000 50000 70000 90000
  • 40. Clustering  Partition data set into clusters, and one can store cluster representation only  Can be very effective if data is clustered but not if data is “smeared”  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8
  • 41. Sampling  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data  Sampling may not reduce database I/Os (page at a time).
  • 42. Sampling WOR ndom SRS le ra t simp e withou ( l samp ment) e eplac r SRSW R Raw Data