Plant propagation: Sexual and Asexual propapagation.pptx
Data pre processing
1. Data Mining
Dr.J.Kalavathi. M.Sc., P.hD.,
Assistant Professor,
Department of Information Technology,
V.V.Vanniaperumal College for Women,
Virudhunagar.
2. Data mining aims at discovering relationships and other forms
of knowledge from data in the real world.
Data map entities in the application domain to symbolic
representation through a measurement function.
Data in the real world is dirty
incomplete: missing data, lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data
noisy: containing errors, such as measurement errors, or outliers
inconsistent: containing discrepancies in codes or names
distorted: sampling distortion (A Change for worse)
3. • No quality data, no quality mining results
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of
quality data
4. Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation(Convert the data into forms for
mining, removes noise from data )
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
5.
6.
7. Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
8. (a) Missing values in data entry
• Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
• Missing data may need to be inferred.
9. (b) Missing Values in existing data
Missing value methods can be used for handling missing data for existing
databases and for data left unknown during or not applicable during entry.
Methods of handling missing data
Ignore the tuples(instances) - the class label is missing (assuming your
data mining goal is classification), or many attributes are missing from the
row (not just one).
Fill in the missing value manually – search for all missing values and
replace them with appropriate values.
Use a global constant to fill in for missing values - Decide on a new
global constant value, like “unknown“, “N/A” or minus infinity, that will be
used to fill all the missing values.
10. Use attribute mean to fill in the missing values - Replace missing values of an
attribute with the mean (or median if its discrete) value for that attribute in the
database.
Use attribute mean for all samples belonging to the same class as the given
tuple - Instead of using the mean (or median) of a certain attribute calculated by
looking at all the rows in a database, we can limit the calculations to the relevant
class to make the value more relevant to the row we’re looking at.
Use a data mining algorithm to predict the most probable value - The value
can be determined using regression, inference based tools using Bayesian
formalism, decision trees, clustering algorithms (K-MeanMedian etc.).
EM (Expectation Maximization) Method –
Compute the expected value of the complete data record.
Substitute the missing values by the expected values.
Multiple imputations - this process creates data
matrices,containing actual raw data values to fill the gaps in an
existing database.
11. • Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
• Noise is a random error or variance in a measured variable.
• Noisy Data may be due to faulty data collection instruments, data
entry problems and technology limitation.
14. • Binning methods sorted data value by consulting its “neighbor- hood,”
that is, the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
• The whole data is divided into segments of equal size and then various
methods are performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
15. • first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
16.
17. • Data mining technique which is used to fit an equation to a dataset
• Here data can be made smooth by fitting it to a regression function.
• The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
• Linear Regression
• 𝑌 = 𝑏 + 𝑚𝑥
• Mx -- > given value
• B -- > Prediction
18.
19. • Combined computer and human inspection " detect
suspicious values and check by human (e.g., deal with
possible outliers)
20. This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.