SlideShare une entreprise Scribd logo
1  sur  17
Chapter 2: Data Preprocessing
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
Data Cleaning
Data cleaning tasks attempts to
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Missing Values
Different ways to fill missing values are:
1. Ignore the tuple:
• Usually done when class label is missing
• Not effective when the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: When there is large set of data with many missing
values this approach is time-consuming and not feasible.
3. Use a global constant to fill in the missing value: If all missing values are replaced by
unknown , then mining program“ ” may mistakenly think that they form an interesting concept.
So this method is simple and not foolproof.
4. Use the attribute mean to fill in the missing value: For example, Use average income value
to replace the missing value for income.
5. Use the attribute mean for all samples belonging to the same class:
For example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: For example, using the other
customer attributes in data set, construct a decision tree to predict the missing values for income.
This may be determined with regression, inference-based tools using a Bayesian formalism also.
Methods 3 to 6 bias the data and the filled-in value may not be correct. Method 6 is a popular
strategy as it preserves relationships between income and the other attributes.
Though data is cleaned after it is seized, data entry procedures should also help minimize the number
of missing values by allowing respondents to specify values such as not applicable" in forms and“
ensuring each attribute has one or more rules regarding the null condition.
Noisy Data
Noise is a random error or variance in a measured variable.
Different data smoothing techniques are as follows:
• Binning
• Regression
• Clustering
Binning:
1. First sort the data and partition into
• Equal-frequency bins each bin contains same number of values.–
(or)
• Equal width bins interval range values in each bin is constant.–
Some binning techniques are
• Smoothing by bin means - each value in a bin is replaced by the mean value of the bin.
• Smoothing by bin medians - each bin value is replaced by the bin median.
• Smoothing by bin boundaries - the minimum and maximum values in a given bin are
the bin boundaries. Each bin value is then replaced by the closest boundary value.
For Example, Consider sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression: Data can be smoothed by fitting the data to a function.
Linear regression involves finding the best" line to two attributes (or variables), so that one“
attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two attributes are
involved and the data are to fit a multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups
or clusters."“
Data Cleaning as a Process: is a two-step process of discrepancy detection and data
transformation that iterates.
Discrepancy can be caused by several factors, including poorly designed data entry forms,
human error in data entry, deliberate errors and data decay(ex: outdated addresses) and data
integration. Using knowledge about domain and data type of each attribute, acceptable
values, expected range, dependencies between attributes, inconsistent use of codes and
representations like “2004/12/25” and “25/12/2004” and field overloading is another source
of error.
The data should also be examined regarding
Unique rule: each value of given attribute must be different from all other values for that
attribute.
Consecutive rule: No null values and all values must be unique.
Null rule: use of blanks, question marks, special characters that indicate null condition
Tools that aid in the step of discrepancy detection are
Data scrubbing tools: uses domain knowledge and rely on parsing and fuzzy matching
techniques.
Data auditing tools analyzes data and discover rules and relationships and detecting data that
violate such conditions.
Tools that assist in the data transformation are
Data migration tools allow simple transformations to be specified such as replace the string
“gender” by “sex”.
ETL(extraction/transformation/loading)tools allows users to specify transforms through GUI.
Some nested discrepancies may only be detected after others have been fixed.
Data integration and transformation
Data mining requires data integration – the merging of data from multiple data stores. The data also
need to be transformed into forms appropriate for mining.
Data integration
Issues to consider during data integration are schema integration and object matching. For
example, how can the data analyst or the computer be sure that customer id in one database and
cust_number in another refer to the same attribute? This problem is known as entity identification
problem.
Metadata can be used to help avoid errors in schema integration.
Redundancy is another issue. An attribute (such as annual revenue, for instance) may be redundant
if it can be “derived" from another attribute or set of attributes. The use of denormalized tables is
another source of data redundancy. Some redundancies can be detected by correlation analysis.
Correlation analysis can measure how strongly one attribute implies the other.
For numerical attributes
Correlation between two attributes, A and B evaluated by computing the correlation coefficient (also
known as Pearson's product moment coefficient, named after its inventor, Karl Pearson). This is
The higher the value, the stronger the correlation
If the resulting value is equal to 0, then A and B are independent and there is no correlation between
them.
If the resulting value is less than 0, then A and B are negatively correlated.
For categorical (discrete) data, a correlation relationship between two attributes, Aand B, can
Suppose that a group of 1,500 people was surveyed. The gender of each person and their preferred
type of reading material was fiction or nonfiction was noted. The observed frequency (or count) of
each possible joint event is summarized in the contingency table shown below
The test is based on a significance level, with (r-1) x (c-1) degrees of freedom.
For this 2 X2 table, the degrees of freedom is (2-1) (2-1) =1. For 1 degree of freedom, the chi square
value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of
upper percentage points of the chi square distribution)
Since computed value is above this, we conclude that the two attributes are (strongly) correlated for
the given group of people.
A third important issue in data integration is the detection and resolution of data value
conflicts.
Example 1: a weight attribute may be stored in metric units in one system and British imperial units in
another.
Example 2:the total sales in one database may refer to one branch of All Electronics, while an
attribute of the same name in another database may refer to the total sales for All Electronics stores
in a given region.
Also the semantic heterogeneity and structure of data pose great challenges in data integration.
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Data trans- formation can involve the following:
1. Smoothing, which works to remove noise from the data. Such techniques include binning,
regression, and clustering.
2. Aggregation, where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total amounts. This
step is typically used in constructing a data cube for analysis of the data at multiple granularities.
3. Generalization of the data, where low-level or primitive" (raw) data are replaced by higher-
level concepts through the use of concept hierarchies. For example, categorical attributes, like street,
can be generalized to higher-level concepts, like city or country. Similarly, values for numerical
attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior.
4. Normalization, where the attribute data are scaled so as to fall within a small specified range,
such as 1.0 to 1.0 or 0.0 to 1.0.
5. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
There are many methods for data normalization and three of them are :
• Min-max normalization,
• Z-score normalization and
• Normalization by decimal scaling.
Min-max normalization performs a linear transformation on the original data.
Min-max normalization preserves the relationships among the original data values. It will encounter
an “out of bounds" error if a future input case for normalization falls outside of the original data range.
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean and standard deviation of A.
This method of normalization is useful when the actual minimum and maximum of attribute A are
unknown, or when there are outliers that dominate the min-max normalization.
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value, v, of
A is normalized to
It is also necessary to save the normalization parameters (such as the mean and standard deviation
if using z-score normalization) so that future data can be normalized in a uniform manner.
In attribute construction, new attributes are constructed from the given attributes and added in
order to help improve the accuracy and understanding of structure in high-dimensional data. For
example, we may wish to add the attribute area based on the attributes height and width. By
combining attributes, attribute construction can discover missing information about the relationships
between data attributes that can be useful for knowledge discovery.
Data reduction obtains a reduced representation of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results.
Strategies for data reduction include the following:
• Data cube aggregation
• Attribute subset selection
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation
1. Data Cube Aggregation
Consider AllElectronics sales per quarter, for the years 2002 to 2004 for analysis.
If you are interested in the annual sales (total per year), rather than the total per quarter, the data
can be aggregated as shown in the below figure.
• Data cubes store multidimensional aggregated information.
• Data cubes are created for varying levels of abstraction.
• Each higher level of abstraction further reduces the resulting data size.
• A cube at the highest level of abstraction is the apex cuboid. For the sales data, the apex
cuboid would give the total sales for all three years, for all item types, and for all branches.
When replying to data mining requests, the smallest available cuboids relevant to the given task
should be used.
2. Attribute subset selection
Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes.
Heuristic methods are commonly used for attribute subset selection.
Basic heuristic methods of attribute subset selection include the following techniques:
1. Stepwise forward selection:
• The procedure starts with an empty set of attributes.
• At each subsequent iteration or step, the best of the remaining original attributes is
added to the set.
2. Stepwise backward elimination:
• The procedure starts with the full set of attributes.
• At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination:
• At each step, the procedure selects the best attribute and removes the worst
attributes.
4. Decision tree induction:
It constructs a flow-chart-like structure where
Each internal (nonleaf) node denotes a test on an attribute,
Each branch corresponds to an outcome of the test,
Each external (leaf) node denotes a class prediction.
The set of attributes appearing in the tree form the reduced subset of attributes.
3. Dimensionality Reduction
Data encoding or transformations are applied for data reduction and compression.
Data reduction is
• Lossless data reduction: If the original data can be reconstructed from the compressed
data without any loss of information.
• Lossy data reduction: If we can reconstruct only an approximation of the original data.
There are two popular and effective methods of lossy reduction:
• Wavelet transforms and
• Principal components analysis.
Wavelet Transforms
When discrete wavelet transform (DWT) is applied to a data vector X, it transforms it to a
numerically different vector, X0, of wavelet coefficients. The two vectors are of the
same length but the wavelet transformed data can be truncated. Given a set of coefficients, an
approximation of the original data can be constructed by applying the inverse of the DWT used.
There are several families of DWTs.
Popular wavelet transforms include
• Haar 2,
• Daubechies 4 and
• Daubechies 6 .
Wavelet transforms can be applied to multidimensional data, such as a data cube.
This is done by first applying the transform to the first dimension, then to the second, and so on.
Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes.
Wavelet transforms are more suitable for data of high dimensionality
Principal Components Analysis
Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-L method),
searches for k n-dimensional orthogonal vectors that can best be used to represent the data,
where k < = n.
The basic procedure is as follows:
1. The input data are normalized, so that each attribute falls within the same range.
2. PCA computes korthonormal unit vectors that provide a basis for the normalized input data.
These vectors are referred to as the principal components.
3. The principal components are sorted in order of decreasing significance" or strength. This“
transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible),
and each succeeding component in turn has the highest variance possible under the
constraint that it be orthogonal to (i.e., uncorrelated with) the preceding components.
4. Since the components are sorted according to decreasing order of significance," the size of the
data can be reduced by eliminating the weaker components.
Advantages of PCA are
• It is computationally inexpensive
• It can be applied to ordered and unordered attributes
• It can handle sparse data and skewed data.
• Multidimensional data of more than two dimensions can be handled.
• Principal components may be used as inputs to multiple regression and cluster
analysis.
4. Numerosity Reduction
Numerosity reduction reduces the data volume by choosing `smaller' forms of data representation.
These techniques can be
• Parametric
• Non-parametric
Parametric methods
In parametric methods, a model is used to estimate the data, so that only the data parameters need
be stored, instead of the actual data.
Ex: Regression and Log-linear models
Regression and Log-linear models
Regression and log-linear models can be used to approximate the given data.
Linear regression
For example, a random variable, y(called a response variable), can be modeled as a linear function
of another random variable, x(called a predictor variable), with the equation
y= wx+ b
• xand yare numerical database attributes.
• wand b(called regression coefficients), specify the slope of the line and the Y-intercept,
These coefficients can be solved for by the method of least squares
Multiple linear regression allows a response variable, y, to be modeled as a linear function of two
or more predictor variables.
Log linear models approximate discrete multidimensional probability distributions.
Given a set of tuples in n dimensions (i.e, n attributes)
Each tuple can be considered as a point in n dimensional space.
Log linear models are used to estimate the probability of each point in multidimensional space for a
set of discretized attributes based on smaller subset of dimensional combinations
Advantages of Regression and Log-Linear
• Regression can be computationally intensive when applied to high-dimensional data.
• Regression can handle skewed data exceptionally well.
• Regression and log-linear models can both be used on sparse data although their application
may be limited.
• Log-linear models show good scalability for up to 10 or so dimensions.
• Log-linear models are also useful for dimensionality reduction and data smoothing.
Non Parametric methods
Nonparametric methods for storing reduced representations of the data include histograms,
clustering, and sampling.
Histograms
A histogram for an attribute, A, partitions the data distribution of Ainto disjoint subsets, or
buckets.
Example: The following data are a list of prices of commonly sold items at AllElectronics.
The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28,
30, 30, 30.
There are several partitioning rules,including the following:
• Equal-width: In an equal-width histogram, the width of each bucket range is uniform
• Equal-frequency (or equidepth): the buckets are created so that, roughly, the frequency
of each bucket is constant
• V-Optimal: In all possible histograms, the V-Optimal histogram is the one with the least
variance.
• MaxDiff: The difference between each pair of adjacent values is considered. A bucket
boundary is established between each pair for pairs having the -1 largest differences, whereβ
is the user-specified number of buckets.β
V-Optimal and MaxDiff histograms tend to be the most accurate and practical.
Histograms are highly effective at approximating both sparse and dense data, highly skewed and
uniform data.
Multidimensional histograms can capture dependencies between attributes and are effective in
approximating data with up to 5 attributes.
Clustering
• Partition data set into clusters based on similarity, and store cluster representation (e.g.,
centroid and diameter) only
• Similarity is how close the objects are in space , based on a distance function.“ ”
• The quality" of a cluster may be represented by its“
 Diameter - the maximum distance between any two objects in the cluster.
 Centroid - the average distance of each cluster object from the cluster centroid
• Can be very effective if data is clustered but not if data is smeared“ ”
• Can have hierarchical clustering and be stored in multi-dimensional index tree structures
For example, consider the root of a B+-tree as
shown with pointers to the data keys 986, 3396,
5411, 8392, and 9544.
Suppose that the tree contains 10,000 tuples with
keys ranging from 1 to 9999.
The data in the tree can be approximated by an
equal-frequency histogram of six buckets
Each bucket contains roughly 10,000/6 items.
Sampling
• Sampling: obtaining a small sample s to
represent the whole data set N
• Allow a mining algorithm to run in complexity that is potentially sub-linear to the size
of the data
Common ways that sample a data set, D containing N tuples are
• Simple random sample without replacement (SRSWOR) of size s:
Choose s<N, where the probability of drawing any tuple in Dis1/N, that is, all tuples are
equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size s: Similar to SRSWOR,
except that after a tuple is drawn, it is placed back in Dso that it may be drawn again.
• Cluster sample: If the tuples in Dare grouped into Mmutually disjoint clusters, then an
SRS of sclusters can be obtained, where s<M.
• Stratified sample: If Dis divided into mutually disjoint parts called strata, a stratified
sample of Dis generated by obtaining an SRS at each stratum.
Advantages of and disadvantages of sampling
• Sampling may not reduce database I/O s (page at a time)
• Simple random sampling may have very poor performance in the presence of skew
• Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or subpopulation of interest) in
the overall database
 Used in conjunction with skewed data
Data Discretization and Concept Hierarchy Generation
Data discretization techniques
• Divide the range of the attribute into intervals.
• Interval labels can then be used to replace actual data values.
Based on how the discretization is performed data discretization techniques are divided into
• Supervised discretization - uses class information
• Unsupervised discretization – based on which direction it proceeds
Top-down or Splitting - Splits entire attribute range by one or a few points.
Bottom-up or Merging - Merges neighborhood values to form intervals.
• Concept hierarchies can be used to reduce the data by collecting and replacing low-level
concepts (such as numerical values for the attribute age) by higher-level concepts (such as
youth, middle-aged, or senior)..
• Mining on a reduced data set requires fewer input/output operations and is more efficient
than mining on a larger, ungeneralized data set.
Discretization and Concept Hierarchy Generation for Numerical Data
Concept hierarchies for numerical attributes can be constructed automatically based on data
discretization using the following methods
• Binning --Top-down split, unsupervised,
• Histogram analysis --Top-down split, unsupervised
• Cluster analysis -- Either top-down split or bottom-up merge, unsupervised
• Entropy-based discretization: supervised, top-down split
• χ2 merging: unsupervised, bottom-up merge
• Discretization by intuitive partitioning: top-down split, unsupervised
Binning
Attribute values can be discretized by applying equal-width or equal-frequency binning, and then
replacing each bin value by the bin mean or median.
It is sensitive to
• User-specified number of bins
• Presence of outliers
Histogram analysis
The histogram analysis algorithm can be applied recursively to each partition in order to
automatically generate a multilevel concept hierarchy, terminating once a pre specified number of
concept levels has been reached. A minimum number of values for each partition at each level is used
to control the recursive procedure.
Entropy-Based Discretization
The value of Athat has the minimum entropy as a split point is selected.
Let Dconsist of data tuples defined by a set of attributes and a class-label attribute. The class-label
attribute provides the class information per tuple. The basic method is as follows:
• Split point for Acan partition the tuples in Dinto two subsets satisfying the conditions
A<=splitpointand A>splitpoint
• Entropy-based discretization- It is unlikely all of the tuples can be divided into classes C1
and C2. The first partition may contain many tuples of C1, but also some of C2. Amount
needed for perfect calculation is called the expected information requirement given by
D1 and D2 are tuples in Dsatisfying the conditions A<=splitpointand A>splitpoint
|D|is the number of tuples in D
• Given mclasses, C1,C2,…….Cm, the entropy of D1 is
pi is the probability of class Ci in D1, determined by dividing the number of tuples of
class Ci in D1 by |D1|.
Interval Merging by χ2 Analysis
• χ2 tests are performed for every pair of adjacent intervals.
• Intervals with the least χ2 values are independent and hence merged.
• Merging proceeds recursively and stops when χ2 values of all pairs of adjacent intervals
exceed a threshold which is determined by specified significance level.
• Significance level is set between 0.10 and 0.01.
• A high value of significance level for the χ2 test may cause over discretization, while a
lower value may lead to under discretization.
Cluster Analysis
Clustering takes the distribution of attributesinto consideration, as well as the closeness of data
points, and therefore is able to produce high quality discretization results.
Discretization by Intuitive Partitioning
• Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear
intuitive or natural."“
• 3-4-5 rule can be used for creating a concept hierarchy.The rule is as follows:
If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then
partition the range into 3 intervals
If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range
into 4 equal-width intervals.
If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range
into 5 equal-width intervals.
Concept Hierarchy Generation for Categorical Data
Categorical attributes have a finitely large number of distinct values, with no ordering among the
values.
Different methods for the generation of concept hierarchies for categorical data are
1. Specification of a partial ordering of attributes explicitly at the schema level by users
or experts:
For example, “location” may contain the following group of attributes:
street, city, province or state, and country.
A hierarchy can be defined by specifying the total ordering among these attributes at the schema
level, such as street <city < state <country.
2.Specification of a portion of a hierarchy by explicit data grouping:
For example, state and country form a hierarchy
at the schema level, a user could define some intermediate levels manually, such as
{ Andhra Pradesh, Tamilnadu, Kerala, Karnataka} South India
3. Specification of a set of attributes, but not of
their partial ordering:
• A user may specify a set of attributes forming
a concept hierarchy, but omit to explicitly
state their partial ordering.
• The system can then try to automatically
generate the attribute ordering so as to
construct a meaningful concept hierarchy
using a heuristic rule that
the attribute with the most distinct values is“
placed at the lowest level of the hierarchy and
attributes with less number of distinct values
are placed at highest level of hierarchy”
4.Specification of only a partial set of attributes:
Sometimes users have only a vague idea about what should be included in a hierarchy.
For example:
For “location” attribute the user may have specified only street and city.
To handle such partially specified hierarchies, it is important to embed data semantics in the
database schema so that attributes with tight semantic connections can be pinned together.
Specification of one attribute may trigger a whole group of semantically tightly linked attributes to
be dragged in" to form a complete hierarchy.“

Contenu connexe

Tendances

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)Pratik Tambekar
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
data mining
data miningdata mining
data mininguoitc
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 

Tendances (20)

Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Data preparation
Data preparationData preparation
Data preparation
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data reduction
Data reductionData reduction
Data reduction
 
Clusters techniques
Clusters techniquesClusters techniques
Clusters techniques
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Artificial Neural Networks for Data Mining
Artificial Neural Networks for Data MiningArtificial Neural Networks for Data Mining
Artificial Neural Networks for Data Mining
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
data mining
data miningdata mining
data mining
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 

Similaire à Data Mining: Data Preprocessing

Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptkannaradhas
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningNandakumar P
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptSubrata Kumer Paul
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdfDimpyJindal4
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdfAlireza418370
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessingpurnimatm
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3DanWooster1
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptxProfPPavanKumar
 

Similaire à Data Mining: Data Preprocessing (20)

1234
12341234
1234
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data integration
Data integrationData integration
Data integration
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdf
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data mining
Data miningData mining
Data mining
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3Upstate CSCI 525 Data Mining Chapter 3
Upstate CSCI 525 Data Mining Chapter 3
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 
03Preprocessing.ppt
03Preprocessing.ppt03Preprocessing.ppt
03Preprocessing.ppt
 
03Preprocessing_plp.pptx
03Preprocessing_plp.pptx03Preprocessing_plp.pptx
03Preprocessing_plp.pptx
 

Plus de Lakshmi Sarvani Videla (20)

Data Science Using Python
Data Science Using PythonData Science Using Python
Data Science Using Python
 
Programs on multithreading
Programs on multithreadingPrograms on multithreading
Programs on multithreading
 
Menu Driven programs in Java
Menu Driven programs in JavaMenu Driven programs in Java
Menu Driven programs in Java
 
Recursion in C
Recursion in CRecursion in C
Recursion in C
 
Simple questions on structures concept
Simple questions on structures conceptSimple questions on structures concept
Simple questions on structures concept
 
Errors incompetitiveprogramming
Errors incompetitiveprogrammingErrors incompetitiveprogramming
Errors incompetitiveprogramming
 
Relational Operators in C
Relational Operators in CRelational Operators in C
Relational Operators in C
 
Recursive functions in C
Recursive functions in CRecursive functions in C
Recursive functions in C
 
Function Pointer in C
Function Pointer in CFunction Pointer in C
Function Pointer in C
 
Functions
FunctionsFunctions
Functions
 
Java sessionnotes
Java sessionnotesJava sessionnotes
Java sessionnotes
 
Singlelinked list
Singlelinked listSinglelinked list
Singlelinked list
 
Graphs
GraphsGraphs
Graphs
 
B trees
B treesB trees
B trees
 
Functions in python3
Functions in python3Functions in python3
Functions in python3
 
Dictionary
DictionaryDictionary
Dictionary
 
Sets
SetsSets
Sets
 
Lists
ListsLists
Lists
 
DataStructures notes
DataStructures notesDataStructures notes
DataStructures notes
 
Solutionsfor co2 C Programs for data structures
Solutionsfor co2 C Programs for data structuresSolutionsfor co2 C Programs for data structures
Solutionsfor co2 C Programs for data structures
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Data Mining: Data Preprocessing

  • 1. Chapter 2: Data Preprocessing • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation Data Cleaning Data cleaning tasks attempts to • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration Missing Values Different ways to fill missing values are: 1. Ignore the tuple: • Usually done when class label is missing • Not effective when the percentage of missing values per attribute varies considerably. 2. Fill in the missing value manually: When there is large set of data with many missing values this approach is time-consuming and not feasible. 3. Use a global constant to fill in the missing value: If all missing values are replaced by unknown , then mining program“ ” may mistakenly think that they form an interesting concept. So this method is simple and not foolproof. 4. Use the attribute mean to fill in the missing value: For example, Use average income value to replace the missing value for income. 5. Use the attribute mean for all samples belonging to the same class: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. 6. Use the most probable value to fill in the missing value: For example, using the other customer attributes in data set, construct a decision tree to predict the missing values for income. This may be determined with regression, inference-based tools using a Bayesian formalism also. Methods 3 to 6 bias the data and the filled-in value may not be correct. Method 6 is a popular strategy as it preserves relationships between income and the other attributes. Though data is cleaned after it is seized, data entry procedures should also help minimize the number of missing values by allowing respondents to specify values such as not applicable" in forms and“ ensuring each attribute has one or more rules regarding the null condition. Noisy Data Noise is a random error or variance in a measured variable. Different data smoothing techniques are as follows: • Binning • Regression • Clustering
  • 2. Binning: 1. First sort the data and partition into • Equal-frequency bins each bin contains same number of values.– (or) • Equal width bins interval range values in each bin is constant.–
  • 3. Some binning techniques are • Smoothing by bin means - each value in a bin is replaced by the mean value of the bin. • Smoothing by bin medians - each bin value is replaced by the bin median. • Smoothing by bin boundaries - the minimum and maximum values in a given bin are the bin boundaries. Each bin value is then replaced by the closest boundary value. For Example, Consider sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Regression: Data can be smoothed by fitting the data to a function. Linear regression involves finding the best" line to two attributes (or variables), so that one“ attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are to fit a multidimensional surface. Clustering: Outliers may be detected by clustering, where similar values are organized into groups or clusters."“ Data Cleaning as a Process: is a two-step process of discrepancy detection and data transformation that iterates. Discrepancy can be caused by several factors, including poorly designed data entry forms, human error in data entry, deliberate errors and data decay(ex: outdated addresses) and data integration. Using knowledge about domain and data type of each attribute, acceptable values, expected range, dependencies between attributes, inconsistent use of codes and representations like “2004/12/25” and “25/12/2004” and field overloading is another source of error. The data should also be examined regarding Unique rule: each value of given attribute must be different from all other values for that attribute. Consecutive rule: No null values and all values must be unique. Null rule: use of blanks, question marks, special characters that indicate null condition Tools that aid in the step of discrepancy detection are Data scrubbing tools: uses domain knowledge and rely on parsing and fuzzy matching techniques. Data auditing tools analyzes data and discover rules and relationships and detecting data that violate such conditions. Tools that assist in the data transformation are Data migration tools allow simple transformations to be specified such as replace the string “gender” by “sex”. ETL(extraction/transformation/loading)tools allows users to specify transforms through GUI. Some nested discrepancies may only be detected after others have been fixed.
  • 4.
  • 5. Data integration and transformation Data mining requires data integration – the merging of data from multiple data stores. The data also need to be transformed into forms appropriate for mining. Data integration Issues to consider during data integration are schema integration and object matching. For example, how can the data analyst or the computer be sure that customer id in one database and cust_number in another refer to the same attribute? This problem is known as entity identification problem. Metadata can be used to help avoid errors in schema integration. Redundancy is another issue. An attribute (such as annual revenue, for instance) may be redundant if it can be “derived" from another attribute or set of attributes. The use of denormalized tables is another source of data redundancy. Some redundancies can be detected by correlation analysis. Correlation analysis can measure how strongly one attribute implies the other. For numerical attributes Correlation between two attributes, A and B evaluated by computing the correlation coefficient (also known as Pearson's product moment coefficient, named after its inventor, Karl Pearson). This is The higher the value, the stronger the correlation If the resulting value is equal to 0, then A and B are independent and there is no correlation between them. If the resulting value is less than 0, then A and B are negatively correlated. For categorical (discrete) data, a correlation relationship between two attributes, Aand B, can Suppose that a group of 1,500 people was surveyed. The gender of each person and their preferred type of reading material was fiction or nonfiction was noted. The observed frequency (or count) of each possible joint event is summarized in the contingency table shown below
  • 6. The test is based on a significance level, with (r-1) x (c-1) degrees of freedom. For this 2 X2 table, the degrees of freedom is (2-1) (2-1) =1. For 1 degree of freedom, the chi square value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage points of the chi square distribution) Since computed value is above this, we conclude that the two attributes are (strongly) correlated for the given group of people. A third important issue in data integration is the detection and resolution of data value conflicts. Example 1: a weight attribute may be stored in metric units in one system and British imperial units in another. Example 2:the total sales in one database may refer to one branch of All Electronics, while an attribute of the same name in another database may refer to the total sales for All Electronics stores in a given region. Also the semantic heterogeneity and structure of data pose great challenges in data integration. Data Transformation In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data trans- formation can involve the following: 1. Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering. 2. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. 3. Generalization of the data, where low-level or primitive" (raw) data are replaced by higher- level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior.
  • 7. 4. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as 1.0 to 1.0 or 0.0 to 1.0. 5. Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process. There are many methods for data normalization and three of them are : • Min-max normalization, • Z-score normalization and • Normalization by decimal scaling. Min-max normalization performs a linear transformation on the original data. Min-max normalization preserves the relationships among the original data values. It will encounter an “out of bounds" error if a future input case for normalization falls outside of the original data range. In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on the mean and standard deviation of A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization. Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value, v, of A is normalized to It is also necessary to save the normalization parameters (such as the mean and standard deviation if using z-score normalization) so that future data can be normalized in a uniform manner. In attribute construction, new attributes are constructed from the given attributes and added in order to help improve the accuracy and understanding of structure in high-dimensional data. For
  • 8. example, we may wish to add the attribute area based on the attributes height and width. By combining attributes, attribute construction can discover missing information about the relationships between data attributes that can be useful for knowledge discovery.
  • 9. Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Strategies for data reduction include the following: • Data cube aggregation • Attribute subset selection • Dimensionality reduction • Numerosity reduction • Discretization and concept hierarchy generation 1. Data Cube Aggregation Consider AllElectronics sales per quarter, for the years 2002 to 2004 for analysis. If you are interested in the annual sales (total per year), rather than the total per quarter, the data can be aggregated as shown in the below figure. • Data cubes store multidimensional aggregated information. • Data cubes are created for varying levels of abstraction. • Each higher level of abstraction further reduces the resulting data size. • A cube at the highest level of abstraction is the apex cuboid. For the sales data, the apex cuboid would give the total sales for all three years, for all item types, and for all branches. When replying to data mining requests, the smallest available cuboids relevant to the given task should be used. 2. Attribute subset selection Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes. Heuristic methods are commonly used for attribute subset selection. Basic heuristic methods of attribute subset selection include the following techniques: 1. Stepwise forward selection: • The procedure starts with an empty set of attributes. • At each subsequent iteration or step, the best of the remaining original attributes is added to the set. 2. Stepwise backward elimination: • The procedure starts with the full set of attributes. • At each step, it removes the worst attribute remaining in the set.
  • 10. 3. Combination of forward selection and backward elimination: • At each step, the procedure selects the best attribute and removes the worst attributes. 4. Decision tree induction: It constructs a flow-chart-like structure where Each internal (nonleaf) node denotes a test on an attribute, Each branch corresponds to an outcome of the test, Each external (leaf) node denotes a class prediction. The set of attributes appearing in the tree form the reduced subset of attributes. 3. Dimensionality Reduction Data encoding or transformations are applied for data reduction and compression. Data reduction is • Lossless data reduction: If the original data can be reconstructed from the compressed data without any loss of information. • Lossy data reduction: If we can reconstruct only an approximation of the original data. There are two popular and effective methods of lossy reduction: • Wavelet transforms and • Principal components analysis. Wavelet Transforms When discrete wavelet transform (DWT) is applied to a data vector X, it transforms it to a numerically different vector, X0, of wavelet coefficients. The two vectors are of the same length but the wavelet transformed data can be truncated. Given a set of coefficients, an approximation of the original data can be constructed by applying the inverse of the DWT used. There are several families of DWTs. Popular wavelet transforms include • Haar 2, • Daubechies 4 and • Daubechies 6 . Wavelet transforms can be applied to multidimensional data, such as a data cube. This is done by first applying the transform to the first dimension, then to the second, and so on. Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes. Wavelet transforms are more suitable for data of high dimensionality
  • 11. Principal Components Analysis Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-L method), searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k < = n. The basic procedure is as follows: 1. The input data are normalized, so that each attribute falls within the same range. 2. PCA computes korthonormal unit vectors that provide a basis for the normalized input data. These vectors are referred to as the principal components. 3. The principal components are sorted in order of decreasing significance" or strength. This“ transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (i.e., uncorrelated with) the preceding components. 4. Since the components are sorted according to decreasing order of significance," the size of the data can be reduced by eliminating the weaker components. Advantages of PCA are • It is computationally inexpensive • It can be applied to ordered and unordered attributes • It can handle sparse data and skewed data. • Multidimensional data of more than two dimensions can be handled. • Principal components may be used as inputs to multiple regression and cluster analysis. 4. Numerosity Reduction Numerosity reduction reduces the data volume by choosing `smaller' forms of data representation. These techniques can be • Parametric • Non-parametric Parametric methods In parametric methods, a model is used to estimate the data, so that only the data parameters need be stored, instead of the actual data. Ex: Regression and Log-linear models Regression and Log-linear models Regression and log-linear models can be used to approximate the given data. Linear regression For example, a random variable, y(called a response variable), can be modeled as a linear function of another random variable, x(called a predictor variable), with the equation y= wx+ b • xand yare numerical database attributes. • wand b(called regression coefficients), specify the slope of the line and the Y-intercept, These coefficients can be solved for by the method of least squares Multiple linear regression allows a response variable, y, to be modeled as a linear function of two or more predictor variables.
  • 12. Log linear models approximate discrete multidimensional probability distributions. Given a set of tuples in n dimensions (i.e, n attributes) Each tuple can be considered as a point in n dimensional space. Log linear models are used to estimate the probability of each point in multidimensional space for a set of discretized attributes based on smaller subset of dimensional combinations Advantages of Regression and Log-Linear • Regression can be computationally intensive when applied to high-dimensional data. • Regression can handle skewed data exceptionally well. • Regression and log-linear models can both be used on sparse data although their application may be limited. • Log-linear models show good scalability for up to 10 or so dimensions. • Log-linear models are also useful for dimensionality reduction and data smoothing. Non Parametric methods Nonparametric methods for storing reduced representations of the data include histograms, clustering, and sampling. Histograms A histogram for an attribute, A, partitions the data distribution of Ainto disjoint subsets, or buckets. Example: The following data are a list of prices of commonly sold items at AllElectronics. The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30. There are several partitioning rules,including the following: • Equal-width: In an equal-width histogram, the width of each bucket range is uniform • Equal-frequency (or equidepth): the buckets are created so that, roughly, the frequency of each bucket is constant • V-Optimal: In all possible histograms, the V-Optimal histogram is the one with the least variance.
  • 13. • MaxDiff: The difference between each pair of adjacent values is considered. A bucket boundary is established between each pair for pairs having the -1 largest differences, whereβ is the user-specified number of buckets.β V-Optimal and MaxDiff histograms tend to be the most accurate and practical. Histograms are highly effective at approximating both sparse and dense data, highly skewed and uniform data. Multidimensional histograms can capture dependencies between attributes and are effective in approximating data with up to 5 attributes. Clustering • Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only • Similarity is how close the objects are in space , based on a distance function.“ ” • The quality" of a cluster may be represented by its“  Diameter - the maximum distance between any two objects in the cluster.  Centroid - the average distance of each cluster object from the cluster centroid • Can be very effective if data is clustered but not if data is smeared“ ” • Can have hierarchical clustering and be stored in multi-dimensional index tree structures For example, consider the root of a B+-tree as shown with pointers to the data keys 986, 3396, 5411, 8392, and 9544. Suppose that the tree contains 10,000 tuples with keys ranging from 1 to 9999. The data in the tree can be approximated by an equal-frequency histogram of six buckets Each bucket contains roughly 10,000/6 items. Sampling • Sampling: obtaining a small sample s to represent the whole data set N • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Common ways that sample a data set, D containing N tuples are • Simple random sample without replacement (SRSWOR) of size s: Choose s<N, where the probability of drawing any tuple in Dis1/N, that is, all tuples are equally likely to be sampled. • Simple random sample with replacement (SRSWR) of size s: Similar to SRSWOR, except that after a tuple is drawn, it is placed back in Dso that it may be drawn again. • Cluster sample: If the tuples in Dare grouped into Mmutually disjoint clusters, then an SRS of sclusters can be obtained, where s<M. • Stratified sample: If Dis divided into mutually disjoint parts called strata, a stratified sample of Dis generated by obtaining an SRS at each stratum. Advantages of and disadvantages of sampling • Sampling may not reduce database I/O s (page at a time) • Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database  Used in conjunction with skewed data
  • 14. Data Discretization and Concept Hierarchy Generation Data discretization techniques • Divide the range of the attribute into intervals. • Interval labels can then be used to replace actual data values. Based on how the discretization is performed data discretization techniques are divided into • Supervised discretization - uses class information • Unsupervised discretization – based on which direction it proceeds Top-down or Splitting - Splits entire attribute range by one or a few points. Bottom-up or Merging - Merges neighborhood values to form intervals. • Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as numerical values for the attribute age) by higher-level concepts (such as youth, middle-aged, or senior).. • Mining on a reduced data set requires fewer input/output operations and is more efficient than mining on a larger, ungeneralized data set. Discretization and Concept Hierarchy Generation for Numerical Data Concept hierarchies for numerical attributes can be constructed automatically based on data discretization using the following methods • Binning --Top-down split, unsupervised, • Histogram analysis --Top-down split, unsupervised • Cluster analysis -- Either top-down split or bottom-up merge, unsupervised • Entropy-based discretization: supervised, top-down split • χ2 merging: unsupervised, bottom-up merge • Discretization by intuitive partitioning: top-down split, unsupervised Binning Attribute values can be discretized by applying equal-width or equal-frequency binning, and then replacing each bin value by the bin mean or median. It is sensitive to • User-specified number of bins • Presence of outliers Histogram analysis The histogram analysis algorithm can be applied recursively to each partition in order to automatically generate a multilevel concept hierarchy, terminating once a pre specified number of concept levels has been reached. A minimum number of values for each partition at each level is used to control the recursive procedure. Entropy-Based Discretization The value of Athat has the minimum entropy as a split point is selected. Let Dconsist of data tuples defined by a set of attributes and a class-label attribute. The class-label attribute provides the class information per tuple. The basic method is as follows: • Split point for Acan partition the tuples in Dinto two subsets satisfying the conditions A<=splitpointand A>splitpoint • Entropy-based discretization- It is unlikely all of the tuples can be divided into classes C1 and C2. The first partition may contain many tuples of C1, but also some of C2. Amount needed for perfect calculation is called the expected information requirement given by D1 and D2 are tuples in Dsatisfying the conditions A<=splitpointand A>splitpoint
  • 15. |D|is the number of tuples in D
  • 16. • Given mclasses, C1,C2,…….Cm, the entropy of D1 is pi is the probability of class Ci in D1, determined by dividing the number of tuples of class Ci in D1 by |D1|. Interval Merging by χ2 Analysis • χ2 tests are performed for every pair of adjacent intervals. • Intervals with the least χ2 values are independent and hence merged. • Merging proceeds recursively and stops when χ2 values of all pairs of adjacent intervals exceed a threshold which is determined by specified significance level. • Significance level is set between 0.10 and 0.01. • A high value of significance level for the χ2 test may cause over discretization, while a lower value may lead to under discretization. Cluster Analysis Clustering takes the distribution of attributesinto consideration, as well as the closeness of data points, and therefore is able to produce high quality discretization results. Discretization by Intuitive Partitioning • Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or natural."“ • 3-4-5 rule can be used for creating a concept hierarchy.The rule is as follows: If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then partition the range into 3 intervals If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range into 4 equal-width intervals. If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the range into 5 equal-width intervals. Concept Hierarchy Generation for Categorical Data Categorical attributes have a finitely large number of distinct values, with no ordering among the values. Different methods for the generation of concept hierarchies for categorical data are 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts: For example, “location” may contain the following group of attributes: street, city, province or state, and country. A hierarchy can be defined by specifying the total ordering among these attributes at the schema level, such as street <city < state <country. 2.Specification of a portion of a hierarchy by explicit data grouping: For example, state and country form a hierarchy at the schema level, a user could define some intermediate levels manually, such as { Andhra Pradesh, Tamilnadu, Kerala, Karnataka} South India
  • 17. 3. Specification of a set of attributes, but not of their partial ordering: • A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. • The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy using a heuristic rule that the attribute with the most distinct values is“ placed at the lowest level of the hierarchy and attributes with less number of distinct values are placed at highest level of hierarchy” 4.Specification of only a partial set of attributes: Sometimes users have only a vague idea about what should be included in a hierarchy. For example: For “location” attribute the user may have specified only street and city. To handle such partially specified hierarchies, it is important to embed data semantics in the database schema so that attributes with tight semantic connections can be pinned together. Specification of one attribute may trigger a whole group of semantically tightly linked attributes to be dragged in" to form a complete hierarchy.“