SlideShare a Scribd company logo
1 of 22
1.
2.
3.
4.
5.
6.
7.
8.

Introduction
Data Quality: Needs of Preprocessing the data?
Data Preprocessing tasks
Data Cleaning
Data integration
Data reduction
Data Transformation and Data Discretization
Conclusion
• It is a process which is comes before applying data mining
technique's
• Low-quality data will lead to low-quality mining results.
• So we need to smear Data Preprocessing techniques such as:
- Data quality
- Data cleaning
- Data integration
- Data reduction
- Data transformation
- Data discremination
• Data have quality if the requirements of the intended use.
• There are many factors comprising data quality, including:
–
–
–
–
–
–

Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
• Data cleaning routines attempt to fill in missing values , smooth out
noise while identifying outliers, and inconsistencies in data.
•

Basic methods of data cleaning:
– Missing value
– Noisy Data
– Data Cleaning as a process
• Ignore the tuple
• Fill in missing values manually
[ time consuming and infeasible]
• Fill in it automatically with
[a global constant : e.g., “Unknown”, ∞]
• Use the most portable value to fill in the missing value [regression,
inference-based tools using Bayesian formalism or decision tree
induction]
• Noise is the random error or variance in a measured variable.
• Binning:
Binning method smooth a sorted data value by consulting its
“neighborhood”, that is, the value around it.
The sorted values are distributed into number of “buckets”, or
“bins”.
• Smoothing by bin means:
Each value in a bin is replaced by the mean value of the bin [4,8,15
in bin is 9].
• Smoothing by bin medians:
Each value in a bin replaced by the bin median
• Smoothing by bin boundaries:
The minimum and maximum values in a given bin are identified as
the bin boundaries each bin values is then replaced by closest
boundary value
Binning is also used as a discretization technique.
• Regression:
Data smoothing can also done by regression, a technique that
conforms of values to the function
– Linear regression involves finding “best” line to fit two
attributes. one attribute used to predict other
– Multiple linear regression extension of linear regression.
• Outlier analysis:
it may be detected by clustering. Where similar values are
organized into groups or clusters.
• The first step in the data cleaning is discrepancy detection
[inconsistent data] .
• The data should examined regarding :
– Unique rule [ each attribute value must be different from all
other attribute value ]
– Consecutive rule [no missing values between lowest and highest
values of the attribute]
– Null rule [specifies the use of blanks, question marks, special
characters]
• Use commercial tools
Data scrubbing: use simple domain knowledge (e.g, postal code,
spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and relationship
to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools:
allow users to
specify transformations through a graphical user interface
• It is the merging of data from multiple
data stores.
• Careful integration avoid and reduce redundancies and
inconsistencies in resulting data set.
• Schema integration: [ Integrate metadata from different sources]
• Entity identification problem: [ Identify real world entities from
multiple data sources]
• Redundancy analysis: [an attribute value may be redundant that
can be detect by correlation analysis]
• This technique applied to obtain a reduced representation of the
data set.
• Data reduction strategies include
– Dimensionality reduction :
Remove unimportant attributes
Its method include wavelet transforms , principal components
analysis(PCA) which transforms the original data onto a smaller
space.
– Numerosity reduction:
Replace the original data volume by alternative
– Data compression:
transformations are applied to obtain a reduced or
“compressed” representation of the original data.
• If the compressed data without any information loss then
the Data reduction is called “lossless”.
• If we reconstruct only an approximation of the original data,
then the Data reduction is called “lossy”.
• Dimensionality reduction and numerosity reduction
techniques can also be considered forms of “data
compression”.
Data compression

Original Data

Compressed
Data

lossless
ss y
lo
Original Data
Approximated

16
• Data transformation routines convert the data into appropriate
forms for mining.
• Strategies for data transformation includes:
 Smoothing: Remove noise from data
 Attribute/feature construction: New attributes constructed
from the given ones to help mining process.
 Aggregation: Summarization, data cube construction. (e.g) daily
sales aggregate to compute monthly or annual total amounts.
 Normalization: Scaled to fall within a smaller, specified range,
min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
• It transforms numeric data by mapping values to interval or
concept labels.
• Discretization and concept hierarchy generation can also be useful,
• where raw values for attributes are replaced by ranges or higher
conceptual levels .
• raw values of a numeric attribute (e.g age) are replaced by interval
lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth ,
adult, senior).
• Three types of attributes
– Nominal values from an unordered set, e.g., color, profession
– Ordinal values from an ordered set [military or academic rank ]
– Numeric real numbers, e.g integer or real numbers

• Discretization:
Divide the range of a continuous attribute into intervals
–
–
–
–
–
–

Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
Although numerous methods of data preprocessing have been
developed ,data preprocessing remains an active area of research
,due to the huge amount of inconsistent or dirty data and the
complexity of the problem.
Data preprocess
Data preprocess

More Related Content

What's hot

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 

What's hot (20)

Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data cubes
Data cubesData cubes
Data cubes
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
3 data visualization
3 data visualization3 data visualization
3 data visualization
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data models
Data modelsData models
Data models
 
OLAP
OLAPOLAP
OLAP
 
Metadata ppt
Metadata pptMetadata ppt
Metadata ppt
 
DATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSINGDATA PREPROCESSING AND DATA CLEANSING
DATA PREPROCESSING AND DATA CLEANSING
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 

Similar to Data preprocess

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&tools
Amandeep Gill
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
extraganesh
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
extraganesh
 

Similar to Data preprocess (20)

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Data Preprocessing&tools
Data Preprocessing&toolsData Preprocessing&tools
Data Preprocessing&tools
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Dmblog
DmblogDmblog
Dmblog
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
overview of_data_processing
overview of_data_processingoverview of_data_processing
overview of_data_processing
 

Recently uploaded

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Data preprocess

  • 1.
  • 2. 1. 2. 3. 4. 5. 6. 7. 8. Introduction Data Quality: Needs of Preprocessing the data? Data Preprocessing tasks Data Cleaning Data integration Data reduction Data Transformation and Data Discretization Conclusion
  • 3. • It is a process which is comes before applying data mining technique's
  • 4. • Low-quality data will lead to low-quality mining results. • So we need to smear Data Preprocessing techniques such as: - Data quality - Data cleaning - Data integration - Data reduction - Data transformation - Data discremination
  • 5. • Data have quality if the requirements of the intended use. • There are many factors comprising data quality, including: – – – – – – Accuracy Completeness Consistency Timeliness Believability Interpretability
  • 6. • Data cleaning routines attempt to fill in missing values , smooth out noise while identifying outliers, and inconsistencies in data. • Basic methods of data cleaning: – Missing value – Noisy Data – Data Cleaning as a process
  • 7. • Ignore the tuple • Fill in missing values manually [ time consuming and infeasible] • Fill in it automatically with [a global constant : e.g., “Unknown”, ∞] • Use the most portable value to fill in the missing value [regression, inference-based tools using Bayesian formalism or decision tree induction]
  • 8. • Noise is the random error or variance in a measured variable. • Binning: Binning method smooth a sorted data value by consulting its “neighborhood”, that is, the value around it. The sorted values are distributed into number of “buckets”, or “bins”.
  • 9. • Smoothing by bin means: Each value in a bin is replaced by the mean value of the bin [4,8,15 in bin is 9]. • Smoothing by bin medians: Each value in a bin replaced by the bin median • Smoothing by bin boundaries: The minimum and maximum values in a given bin are identified as the bin boundaries each bin values is then replaced by closest boundary value Binning is also used as a discretization technique.
  • 10. • Regression: Data smoothing can also done by regression, a technique that conforms of values to the function – Linear regression involves finding “best” line to fit two attributes. one attribute used to predict other – Multiple linear regression extension of linear regression. • Outlier analysis: it may be detected by clustering. Where similar values are organized into groups or clusters.
  • 11. • The first step in the data cleaning is discrepancy detection [inconsistent data] . • The data should examined regarding : – Unique rule [ each attribute value must be different from all other attribute value ] – Consecutive rule [no missing values between lowest and highest values of the attribute] – Null rule [specifies the use of blanks, question marks, special characters]
  • 12. • Use commercial tools Data scrubbing: use simple domain knowledge (e.g, postal code, spell-check) to detect errors and make corrections Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) • Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
  • 13. • It is the merging of data from multiple data stores. • Careful integration avoid and reduce redundancies and inconsistencies in resulting data set. • Schema integration: [ Integrate metadata from different sources] • Entity identification problem: [ Identify real world entities from multiple data sources] • Redundancy analysis: [an attribute value may be redundant that can be detect by correlation analysis]
  • 14. • This technique applied to obtain a reduced representation of the data set. • Data reduction strategies include – Dimensionality reduction : Remove unimportant attributes Its method include wavelet transforms , principal components analysis(PCA) which transforms the original data onto a smaller space.
  • 15. – Numerosity reduction: Replace the original data volume by alternative – Data compression: transformations are applied to obtain a reduced or “compressed” representation of the original data. • If the compressed data without any information loss then the Data reduction is called “lossless”. • If we reconstruct only an approximation of the original data, then the Data reduction is called “lossy”. • Dimensionality reduction and numerosity reduction techniques can also be considered forms of “data compression”.
  • 17. • Data transformation routines convert the data into appropriate forms for mining. • Strategies for data transformation includes:  Smoothing: Remove noise from data  Attribute/feature construction: New attributes constructed from the given ones to help mining process.  Aggregation: Summarization, data cube construction. (e.g) daily sales aggregate to compute monthly or annual total amounts.  Normalization: Scaled to fall within a smaller, specified range, min-max normalization(0.1 to 1.0 or 0.0 to 1.0)
  • 18. • It transforms numeric data by mapping values to interval or concept labels. • Discretization and concept hierarchy generation can also be useful, • where raw values for attributes are replaced by ranges or higher conceptual levels . • raw values of a numeric attribute (e.g age) are replaced by interval lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth , adult, senior).
  • 19. • Three types of attributes – Nominal values from an unordered set, e.g., color, profession – Ordinal values from an ordered set [military or academic rank ] – Numeric real numbers, e.g integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals – – – – – – Interval labels can then be used to replace actual data values Reduce data size by discretization Supervised vs. unsupervised Split (top-down) vs. merge (bottom-up) Discretization can be performed recursively on an attribute Prepare for further analysis, e.g., classification
  • 20. Although numerous methods of data preprocessing have been developed ,data preprocessing remains an active area of research ,due to the huge amount of inconsistent or dirty data and the complexity of the problem.