Pre Processing and Data
Prepration
Big Data & IoT
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
Data Pre-Processing
Learning Objectives
•Understand what is data pre-processing?
• Understand why pre-process the data.
• Understand how to clean the data.
• Understand how to integrate and transform the data.
• Data:
• Text
• Images
• Audio
• Video
• Data pre-processing is the process to convert raw
data into meaningful data using different
techniques.
Data pre-processing
• Incomplete:
missing data, lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
• Noisy:
containing errors, such as measurement errors, or outliers
• Inconsistent:
containing discrepancies in codes or names
• Distorted:
sampling distortion (A Change for worse)
Terms used in pre-processing
Steps of Pre-processing
• Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies and errors
• Data integration
Integration of multiple databases, data cubes, or files
• Data transformation
Normalization and aggregation
• Data reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization
Part of data reduction but with particular importance, especially for numerical data
Data Cleaning
• Data cleaning tasks
i. Fill in missing values
ii. Identify outliers and smooth out noisy data
iii. Correct inconsistent data
Data Cleaning
1. Missing Data
• Data is not always available
a. E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
a. equipment malfunction
b. inconsistent with other recorded data and thus
deleted
c. data not entered due to misunderstanding
d. certain data may not be considered important at the
time of entry
e. not register history or changes of the data
f. Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple; usually done when the class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.)
• Fill in the missing value manually
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
2. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– inconsistent data
How to Handle Noisy Data?
• Binning method:
- first sort data and partition into (equi-depth) bins
- then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
• Clustering
- detect and remove outliers
• Regression
- smooth by fitting the data into regression functions
• Combined computer and human inspection
- detect suspicious values and check by human
Binning Methods for Data Smoothing
* Sorted data for price (in dollars):
4,8,9,15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
1. Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
2. Smoothing by bin boundaries:
- Bin 1: 4, 4, 15, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 34, 34
Data Integration
• Data integration:
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources
• entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id B.cust-#
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are different
• possible reasons: different representations, different scales, e.g., metric vs. British
units
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple
databases
– The same attribute may have different names in different databases
– One attribute may be a “derived” attribute in another table, e.g., annual revenue
• Redundant data may be able to be detected by correlational
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Data Transformation
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Data Reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume,
yet closely maintains the integrity of the original data. That is,
mining on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results.
Data reduction
Strategies
1. Data cube aggregation
2. Dimension reduction
3. Data compression
4. Numerosity reduction
5. Discretization and concept hierarchies
1- Data Cube Aggregation
Data cube aggregation, where aggregation operations are
applied to the data in the construction of a data cube.
2- dimension reduction
• Dimension Reduction, where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.
• The goal is to find a minimum set of attributes such that the
resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes.
• Mining on a reduced set of attributes has an additional benefit. It
reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
2.1- methods of dimension reduction
Basic heuristic methods of attribute dimension reduction include the
following techniques:
o Stepwise forward selection
The procedure starts with an empty set of attributes as the reduced set. The
best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes
is added to the set.
o Stepwise backward elimination
The procedure starts with the full set of attributes. At each step, it removes
the worst attribute remaining in the set.
o Combination of forward selection and backward elimination
The stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
3- Data compression
• In data compression, where encoding mechanisms are used to reduce the
data set size.
• Types of compressions:
• Lossless Compression
Encoding techniques (Run Length Encoding) allows a simple and
minimal data size reduction. Lossless data compression uses
algorithms to restore the precise original data from the compressed
data.
• Lossy Compression
Methods such as Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression.
For e.g., JPEG image format is a lossy compression, but we can
find the meaning equivalent to the original the image. In lossy-data
compression, the decompressed data may differ to the original data
but are useful enough to retrieve information from them.
4- numerosity reduction
• Here, the data are replaced or estimated by alternative, smaller data
representations
• Techniques:
• Parametric methods
A model is used to estimate the data, so that typically only
the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.) Log-linear models, which
estimate discrete multidimensional probability distributions,
are an example
• Non-parametric methods
For storing reduced representations of the data include
histograms, clustering, and sampling.
5- discretization and concept heirarchy
Discretization:
• Techniques of data discretization are used to divide the attributes of the
continuous nature into data with intervals
• Types of discretization:
• Top-down discretization
If you first consider one or a couple of points (so-called
breakpoints or split points) to divide the whole set of attributes and
repeat of this method up to the end, then the process is known as
top-down discretization also known as splitting.
• Bottom-up discretization
If you first consider all the constant values as split-points, some
are discarded through a combination of the neighbourhood values
in the interval, that process is called bottom-up discretization.
5- discretization and concept hierarchy
Concept Hierarchy:
• It reduces the data size by collecting and then replacing the low-level
concepts
• Techniques:
1. Binning
Binning is the process of changing numerical variables into
categorical counterparts. The number of categorical counterparts
depends on the number of bins specified by the user.
2. Histogram Analysis
Used to partition the value for the attribute X, into disjoint ranges
called bracket
5- discretization and concept hierarchy
• Histogram partitioning rules:
•Equal frequency partitioning
Partitioning the values based on their number of occurrences
in the data set.
•Equal width partitioning
Partitioning the values in a fixed gap based on the number of
bins i.e. a set of values ranging from 0-20
•Clustering
Grouping the similar data together
Benefits of data preparation
1. Ensure the data used in analytics applications produces reliable
results;
2. Identify and fix data issues that otherwise might not be detected;
3. Enable more informed decision-making by business executives
and operational workers;
4. Reduce data management and analytics costs;
5. Avoid duplication of effort in preparing data for use in multiple
applications; and
6. Get a higher ROI from BI and analytics initiatives.
Summary
• Data preparation or preprocessing is a big issue for both data
warehousing and data mining
• Descriptive data summarization is needed for quality data
preprocessing
• Data preparation includes
Data cleaning and data integration
Data transformation and reduction
Discretization
• A lot a methods have been developed but data preprocessing
still an active area of research