• It is a process which is comes before applying data mining
• Low-quality data will lead to low-quality mining results.
• So we need to smear Data Preprocessing techniques such as:
- Data quality
- Data cleaning
- Data integration
- Data reduction
- Data transformation
- Data discremination
• Data have quality if the requirements of the intended use.
• There are many factors comprising data quality, including:
• Data cleaning routines attempt to fill in missing values , smooth out
noise while identifying outliers, and inconsistencies in data.
Basic methods of data cleaning:
– Missing value
– Noisy Data
– Data Cleaning as a process
• Ignore the tuple
• Fill in missing values manually
[ time consuming and infeasible]
• Fill in it automatically with
[a global constant : e.g., “Unknown”, ∞]
• Use the most portable value to fill in the missing value [regression,
inference-based tools using Bayesian formalism or decision tree
• Noise is the random error or variance in a measured variable.
Binning method smooth a sorted data value by consulting its
“neighborhood”, that is, the value around it.
The sorted values are distributed into number of “buckets”, or
• Smoothing by bin means:
Each value in a bin is replaced by the mean value of the bin [4,8,15
in bin is 9].
• Smoothing by bin medians:
Each value in a bin replaced by the bin median
• Smoothing by bin boundaries:
The minimum and maximum values in a given bin are identified as
the bin boundaries each bin values is then replaced by closest
Binning is also used as a discretization technique.
Data smoothing can also done by regression, a technique that
conforms of values to the function
– Linear regression involves finding “best” line to fit two
attributes. one attribute used to predict other
– Multiple linear regression extension of linear regression.
• Outlier analysis:
it may be detected by clustering. Where similar values are
organized into groups or clusters.
• The first step in the data cleaning is discrepancy detection
[inconsistent data] .
• The data should examined regarding :
– Unique rule [ each attribute value must be different from all
other attribute value ]
– Consecutive rule [no missing values between lowest and highest
values of the attribute]
– Null rule [specifies the use of blanks, question marks, special
• Use commercial tools
Data scrubbing: use simple domain knowledge (e.g, postal code,
spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship
to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools:
allow users to
specify transformations through a graphical user interface
• It is the merging of data from multiple
• Careful integration avoid and reduce redundancies and
inconsistencies in resulting data set.
• Schema integration: [ Integrate metadata from different sources]
• Entity identification problem: [ Identify real world entities from
multiple data sources]
• Redundancy analysis: [an attribute value may be redundant that
can be detect by correlation analysis]
• This technique applied to obtain a reduced representation of the
• Data reduction strategies include
– Dimensionality reduction :
Remove unimportant attributes
Its method include wavelet transforms , principal components
analysis(PCA) which transforms the original data onto a smaller
– Numerosity reduction:
Replace the original data volume by alternative
– Data compression:
transformations are applied to obtain a reduced or
“compressed” representation of the original data.
• If the compressed data without any information loss then
the Data reduction is called “lossless”.
• If we reconstruct only an approximation of the original data,
then the Data reduction is called “lossy”.
• Dimensionality reduction and numerosity reduction
techniques can also be considered forms of “data
• Data transformation routines convert the data into appropriate
forms for mining.
• Strategies for data transformation includes:
Smoothing: Remove noise from data
Attribute/feature construction: New attributes constructed
from the given ones to help mining process.
Aggregation: Summarization, data cube construction. (e.g) daily
sales aggregate to compute monthly or annual total amounts.
Normalization: Scaled to fall within a smaller, specified range,
min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
• It transforms numeric data by mapping values to interval or
• Discretization and concept hierarchy generation can also be useful,
• where raw values for attributes are replaced by ranges or higher
conceptual levels .
• raw values of a numeric attribute (e.g age) are replaced by interval
lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth ,
• Three types of attributes
– Nominal values from an unordered set, e.g., color, profession
– Ordinal values from an ordered set [military or academic rank ]
– Numeric real numbers, e.g integer or real numbers
Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
Although numerous methods of data preprocessing have been
developed ,data preprocessing remains an active area of research
,due to the huge amount of inconsistent or dirty data and the
complexity of the problem.