Data pre processing

Data Mining
Dr.J.Kalavathi. M.Sc., P.hD.,
Assistant Professor,
Department of Information Technology,
V.V.Vanniaperumal College for Women,
Virudhunagar.

Data mining aims at discovering relationships and other forms
of knowledge from data in the real world.
Data map entities in the application domain to symbolic
representation through a measurement function.
Data in the real world is dirty
incomplete: missing data, lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data
noisy: containing errors, such as measurement errors, or outliers
inconsistent: containing discrepancies in codes or names
distorted: sampling distortion (A Change for worse)

• No quality data, no quality mining results
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of
quality data

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation(Convert the data into forms for
mining, removes noise from data )
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data

 Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration

(a) Missing values in data entry
• Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
• Missing data may need to be inferred.

(b) Missing Values in existing data
Missing value methods can be used for handling missing data for existing
databases and for data left unknown during or not applicable during entry.
Methods of handling missing data
Ignore the tuples(instances) - the class label is missing (assuming your
data mining goal is classification), or many attributes are missing from the
row (not just one).
Fill in the missing value manually – search for all missing values and
replace them with appropriate values.
Use a global constant to fill in for missing values - Decide on a new
global constant value, like “unknown“, “N/A” or minus infinity, that will be
used to fill all the missing values.

Use attribute mean to fill in the missing values - Replace missing values of an
attribute with the mean (or median if its discrete) value for that attribute in the
database.
Use attribute mean for all samples belonging to the same class as the given
tuple - Instead of using the mean (or median) of a certain attribute calculated by
looking at all the rows in a database, we can limit the calculations to the relevant
class to make the value more relevant to the row we’re looking at.
Use a data mining algorithm to predict the most probable value - The value
can be determined using regression, inference based tools using Bayesian
formalism, decision trees, clustering algorithms (K-MeanMedian etc.).
EM (Expectation Maximization) Method –
Compute the expected value of the complete data record.
Substitute the missing values by the expected values.
Multiple imputations - this process creates data
matrices,containing actual raw data values to fill the gaps in an
existing database.

• Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
• Noise is a random error or variance in a measured variable.
• Noisy Data may be due to faulty data collection instruments, data
entry problems and technology limitation.

• Binning
• Clustering
• Regression
• Computer and Human
Inspection

• Binning methods sorted data value by consulting its “neighbor- hood,”
that is, the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
• The whole data is divided into segments of equal size and then various
methods are performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values
can be used to complete the task.

• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.

• Data mining technique which is used to fit an equation to a dataset
• Here data can be made smooth by fitting it to a regression function.
• The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
• Linear Regression
• 𝑌 = 𝑏 + 𝑚𝑥
• Mx -- > given value
• B -- > Prediction

• Combined computer and human inspection " detect
suspicious values and check by human (e.g., deal with
possible outliers)

This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

Data pre processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Data pre processing

Similaire à Data pre processing (20)

Plus de kalavathisugan

Plus de kalavathisugan (13)

Dernier

Dernier (20)

Data pre processing