Data preprocessing PPT

Data Preprocessing
MS. T.K. ANUSUYA
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
BON SECOURS COLLEGE FOR WOMEN, THANJAVUR.

Why Data Pre-processing?
 Data in real-world
 Highly noisy, - errors or outliers
 Missing/incomplete – lacking attribute values eg name=“”
 Duplicate tuples
 inconsistent data due to their typically huge size.
 Low quality data
 low quality mining results.
 Different data sources
 Data extraction, cleaning and transformation
2
Data Pre-processing

Multi Dimensional Measure of Data Quality
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
3
Data Pre-processing

Data Pre-processing Techniques
 Data Cleaning
 Data integration
 Data reduction
 Data transformation
4
Data Pre-processing

Data Pre-processing Techniques
 Data Cleaning
 Missing values(noisy data), outliers , data’s are dirty
 Data Integration
 Integration of multiple databases, data cubes or files
 Data Transformation
 Normalization and aggregation
 Data Reduction
 Reduce data size,/compressed, aggregating, eliminating redundant
features
 Dimensionality reduction -removing irrelevant attributes
 Numerosity reduction – replaced by alternatives,
parametric models(regression /log linear models) or
non parametric models(eg. Histograms, clusters, sampling and data aggregation)
5
Data Pre-processing

Data Cleaning
To fill in missing values, smooth out noisy while identifying outliers and correct inconsistencies in the
data
• Missing Values
• Ignore the tuple – when class label is missing
• Fill in the missing value manually –tedious and infeasible
• Use a global constant to fill in the missing value – unknown a new class
• Use a measure of central tendency for the attribute
• Use the attribute mean or median for all samples belonging to the same class as the given tuple.
• Use the most probable value to fill on the missing value. –regression, Bayesian formula, decision
trees.
6
Data Pre-processing

Data Cleaning
• Noisy Data
• Noise is a random error or variance in a measured variable.
• Binning Method : sorting the data
• Smooth by bin median, median and boundaries.
• Clustering – detect and remove outliers
• Semi Automated – Computer and Manual intervention
• Regression – use regression functions
7
Data Pre-processing

Data Integration
 Data Integration
 Merging of data from multiple data stores.
 Reduce and avoid redundancies and inconsistencies
 Improves the accuracy and speed of the mining process.
 Entity identification problem
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
 Redundant attributes may be able to detected to correlation analysis and covariance analysis
8
Data Pre-processing

Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
ExpectedObserved 2
2 )(

9
Data Pre-processing

10
Data Pre-processing
Chi-square Calculation-example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
It shows that like_science fiction and playchess are correlated in the group
93.507
840
)8401000(
360
)360200(
210
)21050(
90
)90250( 2222
2










11
Data Pre-processing
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAA
r BA
 )1(
)(
)1(
))((
,








Data Reduction
 Reduced representation of the data set that is much smaller in volume, yet closely
maintains the integrity of base data.
 Data cube aggregation
 Dimensionality reduction - reducing the random variables or attributes under
consideration (Wavelet Transforms)
 Numerosity reduction – Regression and log linear models, Histograms, Clustoring,
Sampling Data cube aggregation
 Data compression
12
Data Pre-processing

Wavelet Transform
 Data are transformed to preserve relative distance between objects at different
levels of resolutions
 Used for image compression
13
Data Pre-processing

Numerosity Reduction
 Reduce data volume by choosing alternative forms of data representation
 Parametric Methods (Regressions)
 Assume the data fits in models
 Linear Regression -Straight line
 Multiple Regression – multidimensional vector
 Log linear model- discrete multidimensional distributions
 Non-Parametric Methods
 Don’t assume models (Histograms, clustering, sampling…)
14
Data Pre-processing

Histograms
 Popular Data reduction techniques
 Divide and equal the data into buckets and store average for each bucket
15
Data Pre-processing

Data Cube Aggregation
 The lowest level of a data cube (Cubiod)
 A cube is highest level of abstraction is apex cuboid
 Multiple levels of aggregation in data cubes
 Provide fast access to precomputed, summarized data.
 Reduce the size of data
16
Data Pre-processing

Data Transformation
 Pre-processing step
 Data are transformed or consolidated the resulting mining process may be more efficient and
the patterns found.
 Smoothing – remove noisy data (binning, regression and clustering)
 Attribute construction – new attributes constructed
 Aggregation –summarized, data cube
 Normalization –(min-max, z-score)
 Discretization –hierarchy climbing
 Concept hierarchy generation for nominal data
17
Data Pre-processing

Normalization
 Min – maz normalization 9new mina, new maxA)
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Normalization by decimal scaling where j is the smallest integer such that
max v <1
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
Av
v


'
j
v
v'
18
Data Pre-processing

Data Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
Data Pre-processing
19

Data Discretization
 Discretization
 Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as young, middle-aged, or senior)
Data Pre-processing
20

Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
21
Data Pre-processing

Data preprocessing PPT

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data preprocessing PPT

Similaire à Data preprocessing PPT (20)

Plus de ANUSUYA T K

Plus de ANUSUYA T K (16)

Dernier

Dernier (20)

Data preprocessing PPT