SlideShare une entreprise Scribd logo
1  sur  14
Data Preprocessing
By
S.Dinesh Babu
II MCA
Definition
 Data preprocessing is a data mining technique
that involves transforming raw data into an
understandable format.
 Data in the real world is dirty
Measures for data quality:A multidimensional view
◦ Accuracy: correct or wrong, accurate or not
◦ Completeness: not recorded, unavailable, …
◦ Consistency: some modified but some not,
dangling, …
◦ Timeliness: timely update?
◦ Believability: how trustable the data are correct?
◦ Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data
Discretization
Data Cleaning: Incomplete
 Data is not always available
 Ex:Age:” ”;
 Missing data may be due to
◦ equipment malfunction
◦ inconsistent with other recorded data and thus deleted
◦ data not entered due to misunderstanding
◦ certain data may not be considered important at the
time of entry
Noisy Data
 Unstructured Data.
 Increases the amount of storage space .
Causes:
Hardware Failure
Programming Errors
Data Cleaning as a Process
 Missing values, noise, and inconsistencies contribute to
inaccurate data.
 The first step in data cleaning as a process is
discrepancy detection.
 Discrepancies can be caused by several factors.
 Poorly designed data entry forms
 human error in data entry
The data should also be examined regarding:
o Unique rule:
Each attribute value must be different from all other attribute
value.
o Consecutive rule
No missing values between lowest and highest values of the
attribute.
o Null rule
Specifies the use of blanks, question marks, special
characters.
Data Integration
 The merging of data from multiple data stores.
 It can help reduce, avoid redundancies and
inconsistencies.
 It improve the accuracy and speed of the subsequent
data mining process.
Data Reduction
 To obtain a reduced representation of the data set that is
much smaller in volume.
Strategies for data reduction include the following:
 Data cube aggregation, where aggregation operations
are applied to the data in the construction of a data cube.
 Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be
detected and removed.
 Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
 Numerosity reduction, where the data are replaced or
estimated by alternative, smaller data representations such as
 Parametric models
 Nonparametric methods such as clustering, sampling,
and the use of histograms.
Data Transformation
 In data transformation, the data are transformed or
consolidated into forms appropriate for mining.
Data transformation can involve the following:
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
Data Discretization
 Discretization: Divide the range of a continuous attribute
into intervals
◦ Interval labels can then be used to replace actual data
values
◦ Reduce data size by Discretization
◦ Split (top-down) vs. merge (bottom-up)
◦ Discretization can be performed recursively on an
attribute
◦ Prepare for further analysis, e.g., classification
 Three types of attributes
◦ Nominal—values from an unordered set, e.g., color, profession
◦ Ordinal—values from an ordered set, e.g., military or academic
rank
◦ Numeric—real numbers, e.g., integer or real numbers

Contenu connexe

En vedette

Tutorial Blogspot
Tutorial BlogspotTutorial Blogspot
Tutorial Blogspotudalaitz
 
AQA Biology 1A Fighting Disease
AQA Biology 1A Fighting DiseaseAQA Biology 1A Fighting Disease
AQA Biology 1A Fighting Diseasesherinshaju
 
Crazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternativeCrazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternativeIlari Henrik Aegerter
 
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...DRISHTI-SOFT SOLUTIONS PVT. LTD.
 
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)Kanjana thong
 
Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.Ygal Weitzman
 
A Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEOA Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEOGreg Gifford
 
IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編Shoichi Sakane
 
The Valley of Disruption
The Valley of DisruptionThe Valley of Disruption
The Valley of DisruptionPeter Hinssen
 
What the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for PharmaceuticalsWhat the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for PharmaceuticalsMedullan
 

En vedette (13)

Tutorial Blogspot
Tutorial BlogspotTutorial Blogspot
Tutorial Blogspot
 
Infografía
InfografíaInfografía
Infografía
 
AQA Biology 1A Fighting Disease
AQA Biology 1A Fighting DiseaseAQA Biology 1A Fighting Disease
AQA Biology 1A Fighting Disease
 
Crazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternativeCrazy leaders, micromanagement and blaming culture - is there an alternative
Crazy leaders, micromanagement and blaming culture - is there an alternative
 
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...Increase retention by 35% and avoid US$2.5 million penalties with a single so...
Increase retention by 35% and avoid US$2.5 million penalties with a single so...
 
Орчлон ертөнц
Орчлон ертөнцОрчлон ертөнц
Орчлон ертөнц
 
SATUAN HIDUP DALAM EKOSISTEM
SATUAN HIDUP DALAM EKOSISTEMSATUAN HIDUP DALAM EKOSISTEM
SATUAN HIDUP DALAM EKOSISTEM
 
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
รายงานผลงานการติดตามและประเมินผลแผนพัฒนาสามปี (พ.ศ.2559-2560)
 
Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.Self ordering kiosk_software - Atsmit self service sulutions ltd.
Self ordering kiosk_software - Atsmit self service sulutions ltd.
 
A Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEOA Fast & Furious Guide to Local SEO
A Fast & Furious Guide to Local SEO
 
IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編IoT時代のインターネット技術動向 インフラプロトコル編
IoT時代のインターネット技術動向 インフラプロトコル編
 
The Valley of Disruption
The Valley of DisruptionThe Valley of Disruption
The Valley of Disruption
 
What the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for PharmaceuticalsWhat the Shift to Value Means for Pharmaceuticals
What the Shift to Value Means for Pharmaceuticals
 

Similaire à Data preprocessing

Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17AnwarrChaudary
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology VaishaghMp
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptbelay41
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processingDatamining Tools
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 

Similaire à Data preprocessing (20)

Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology Editing, cleaning and coding of data in Business research methodology
Editing, cleaning and coding of data in Business research methodology
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Unit2
Unit2Unit2
Unit2
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Preprocess
PreprocessPreprocess
Preprocess
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Pre processing
Pre processingPre processing
Pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
1234
12341234
1234
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 

Data preprocessing

  • 2. Definition  Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.  Data in the real world is dirty
  • 3. Measures for data quality:A multidimensional view ◦ Accuracy: correct or wrong, accurate or not ◦ Completeness: not recorded, unavailable, … ◦ Consistency: some modified but some not, dangling, … ◦ Timeliness: timely update? ◦ Believability: how trustable the data are correct? ◦ Interpretability: how easily the data can be understood?
  • 4. Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization
  • 5. Data Cleaning: Incomplete  Data is not always available  Ex:Age:” ”;  Missing data may be due to ◦ equipment malfunction ◦ inconsistent with other recorded data and thus deleted ◦ data not entered due to misunderstanding ◦ certain data may not be considered important at the time of entry
  • 6. Noisy Data  Unstructured Data.  Increases the amount of storage space . Causes: Hardware Failure Programming Errors
  • 7. Data Cleaning as a Process  Missing values, noise, and inconsistencies contribute to inaccurate data.  The first step in data cleaning as a process is discrepancy detection.  Discrepancies can be caused by several factors.  Poorly designed data entry forms  human error in data entry
  • 8. The data should also be examined regarding: o Unique rule: Each attribute value must be different from all other attribute value. o Consecutive rule No missing values between lowest and highest values of the attribute. o Null rule Specifies the use of blanks, question marks, special characters.
  • 9. Data Integration  The merging of data from multiple data stores.  It can help reduce, avoid redundancies and inconsistencies.  It improve the accuracy and speed of the subsequent data mining process.
  • 10. Data Reduction  To obtain a reduced representation of the data set that is much smaller in volume. Strategies for data reduction include the following:  Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.  Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
  • 11.  Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.  Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as  Parametric models  Nonparametric methods such as clustering, sampling, and the use of histograms.
  • 12. Data Transformation  In data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:  Smoothing: remove noise from data  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range  min-max normalization
  • 13. Data Discretization  Discretization: Divide the range of a continuous attribute into intervals ◦ Interval labels can then be used to replace actual data values ◦ Reduce data size by Discretization ◦ Split (top-down) vs. merge (bottom-up) ◦ Discretization can be performed recursively on an attribute ◦ Prepare for further analysis, e.g., classification
  • 14.  Three types of attributes ◦ Nominal—values from an unordered set, e.g., color, profession ◦ Ordinal—values from an ordered set, e.g., military or academic rank ◦ Numeric—real numbers, e.g., integer or real numbers