SlideShare une entreprise Scribd logo
1  sur  23
Data Preprocessing
MS. T.K. ANUSUYA
ASSISTANT PROFESSOR
DEPARTMENT OF COMPUTER SCIENCE
BON SECOURS COLLEGE FOR WOMEN, THANJAVUR.
Why Data Pre-processing?
 Data in real-world
 Highly noisy, - errors or outliers
 Missing/incomplete – lacking attribute values eg name=“”
 Duplicate tuples
 inconsistent data due to their typically huge size.
 Low quality data
 low quality mining results.
 Different data sources
 Data extraction, cleaning and transformation
2
Data Pre-processing
Multi Dimensional Measure of Data Quality
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
3
Data Pre-processing
Data Pre-processing Techniques
 Data Cleaning
 Data integration
 Data reduction
 Data transformation
4
Data Pre-processing
Data Pre-processing Techniques
 Data Cleaning
 Missing values(noisy data), outliers , data’s are dirty
 Data Integration
 Integration of multiple databases, data cubes or files
 Data Transformation
 Normalization and aggregation
 Data Reduction
 Reduce data size,/compressed, aggregating, eliminating redundant
features
 Dimensionality reduction -removing irrelevant attributes
 Numerosity reduction – replaced by alternatives,
parametric models(regression /log linear models) or
non parametric models(eg. Histograms, clusters, sampling and data aggregation)
5
Data Pre-processing
Data Cleaning
To fill in missing values, smooth out noisy while identifying outliers and correct inconsistencies in the
data
• Missing Values
• Ignore the tuple – when class label is missing
• Fill in the missing value manually –tedious and infeasible
• Use a global constant to fill in the missing value – unknown a new class
• Use a measure of central tendency for the attribute
• Use the attribute mean or median for all samples belonging to the same class as the given tuple.
• Use the most probable value to fill on the missing value. –regression, Bayesian formula, decision
trees.
6
Data Pre-processing
Data Cleaning
• Noisy Data
• Noise is a random error or variance in a measured variable.
• Binning Method : sorting the data
• Smooth by bin median, median and boundaries.
• Clustering – detect and remove outliers
• Semi Automated – Computer and Manual intervention
• Regression – use regression functions
7
Data Pre-processing
Data Integration
 Data Integration
 Merging of data from multiple data stores.
 Reduce and avoid redundancies and inconsistencies
 Improves the accuracy and speed of the mining process.
 Entity identification problem
 Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
 Redundant attributes may be able to detected to correlation analysis and covariance analysis
8
Data Pre-processing
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
ExpectedObserved 2
2 )(

9
Data Pre-processing
10
Data Pre-processing
Chi-square Calculation-example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the
data distribution in the two categories)
It shows that like_science fiction and playchess are correlated in the group
93.507
840
)8401000(
360
)360200(
210
)21050(
90
)90250( 2222
2









11
Data Pre-processing
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
rA,B = 0: independent; rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAA
r BA
 )1(
)(
)1(
))((
,







Data Reduction
 Reduced representation of the data set that is much smaller in volume, yet closely
maintains the integrity of base data.
 Data cube aggregation
 Dimensionality reduction - reducing the random variables or attributes under
consideration (Wavelet Transforms)
 Numerosity reduction – Regression and log linear models, Histograms, Clustoring,
Sampling Data cube aggregation
 Data compression
12
Data Pre-processing
Wavelet Transform
 Data are transformed to preserve relative distance between objects at different
levels of resolutions
 Used for image compression
13
Data Pre-processing
Numerosity Reduction
 Reduce data volume by choosing alternative forms of data representation
 Parametric Methods (Regressions)
 Assume the data fits in models
 Linear Regression -Straight line
 Multiple Regression – multidimensional vector
 Log linear model- discrete multidimensional distributions
 Non-Parametric Methods
 Don’t assume models (Histograms, clustering, sampling…)
14
Data Pre-processing
Histograms
 Popular Data reduction techniques
 Divide and equal the data into buckets and store average for each bucket
15
Data Pre-processing
Data Cube Aggregation
 The lowest level of a data cube (Cubiod)
 A cube is highest level of abstraction is apex cuboid
 Multiple levels of aggregation in data cubes
 Provide fast access to precomputed, summarized data.
 Reduce the size of data
16
Data Pre-processing
Data Transformation
 Pre-processing step
 Data are transformed or consolidated the resulting mining process may be more efficient and
the patterns found.
 Smoothing – remove noisy data (binning, regression and clustering)
 Attribute construction – new attributes constructed
 Aggregation –summarized, data cube
 Normalization –(min-max, z-score)
 Discretization –hierarchy climbing
 Concept hierarchy generation for nominal data
17
Data Pre-processing
Normalization
 Min – maz normalization 9new mina, new maxA)
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
 Normalization by decimal scaling where j is the smallest integer such that
max v <1
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



A
Av
v


'
j
v
v'
18
Data Pre-processing
Data Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
Data Pre-processing
19
Data Discretization
 Discretization
 Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
 Interval labels can then be used to replace actual data values
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as young, middle-aged, or senior)
Data Pre-processing
20
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up merge)
21
Data Pre-processing
Data Pre-processing
22
Data Pre-processing
23

Contenu connexe

Tendances

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Semantic nets in artificial intelligence
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligenceharshita virwani
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Supervised learning
Supervised learningSupervised learning
Supervised learningAlia Hamwi
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Naive bayes
Naive bayesNaive bayes
Naive bayesumeskath
 

Tendances (20)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Machine learning
Machine learningMachine learning
Machine learning
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Data Mining
Data MiningData Mining
Data Mining
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Semantic nets in artificial intelligence
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligence
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 

Similaire à Data preprocessing PPT

Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptRevathy V R
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2extraganesh
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdfAlireza418370
 

Similaire à Data preprocessing PPT (20)

Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Preprocess
PreprocessPreprocess
Preprocess
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data processing
Data processingData processing
Data processing
 
03Preprocessing01.pdf
03Preprocessing01.pdf03Preprocessing01.pdf
03Preprocessing01.pdf
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 

Plus de ANUSUYA T K

Chap3 Device Technology
Chap3 Device TechnologyChap3 Device Technology
Chap3 Device TechnologyANUSUYA T K
 
Introduction to Corel Draw
Introduction to Corel DrawIntroduction to Corel Draw
Introduction to Corel DrawANUSUYA T K
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dwANUSUYA T K
 
Chap 2-pc applications examples
Chap 2-pc applications examplesChap 2-pc applications examples
Chap 2-pc applications examplesANUSUYA T K
 
Chap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive ComputingChap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive ComputingANUSUYA T K
 
Pagemaker7.0 layout
Pagemaker7.0 layoutPagemaker7.0 layout
Pagemaker7.0 layoutANUSUYA T K
 
Mail merge in page maker 7
Mail merge in page maker 7Mail merge in page maker 7
Mail merge in page maker 7ANUSUYA T K
 
Layers and types of cloud
Layers and types of cloudLayers and types of cloud
Layers and types of cloudANUSUYA T K
 
Cloud deployment models
Cloud deployment modelsCloud deployment models
Cloud deployment modelsANUSUYA T K
 
Virtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesVirtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesANUSUYA T K
 
VM for cloud infrastructure
VM for cloud infrastructureVM for cloud infrastructure
VM for cloud infrastructureANUSUYA T K
 
Cloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a serviceCloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a serviceANUSUYA T K
 
Data Storage in Cloud computing
Data Storage in Cloud computingData Storage in Cloud computing
Data Storage in Cloud computingANUSUYA T K
 
Migrating into a cloud
Migrating into a cloudMigrating into a cloud
Migrating into a cloudANUSUYA T K
 
Cloud computing introduction
Cloud computing introductionCloud computing introduction
Cloud computing introductionANUSUYA T K
 

Plus de ANUSUYA T K (16)

Chap3 Device Technology
Chap3 Device TechnologyChap3 Device Technology
Chap3 Device Technology
 
Introduction to Corel Draw
Introduction to Corel DrawIntroduction to Corel Draw
Introduction to Corel Draw
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
Chap 2-pc applications examples
Chap 2-pc applications examplesChap 2-pc applications examples
Chap 2-pc applications examples
 
Chap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive ComputingChap1 introduction to Pervasive Computing
Chap1 introduction to Pervasive Computing
 
Pagemaker7.0 layout
Pagemaker7.0 layoutPagemaker7.0 layout
Pagemaker7.0 layout
 
Mail merge in page maker 7
Mail merge in page maker 7Mail merge in page maker 7
Mail merge in page maker 7
 
Layers and types of cloud
Layers and types of cloudLayers and types of cloud
Layers and types of cloud
 
Cloud deployment models
Cloud deployment modelsCloud deployment models
Cloud deployment models
 
Cc chap-8
Cc chap-8Cc chap-8
Cc chap-8
 
Virtual Machine provisioning and migration services
Virtual Machine provisioning and migration servicesVirtual Machine provisioning and migration services
Virtual Machine provisioning and migration services
 
VM for cloud infrastructure
VM for cloud infrastructureVM for cloud infrastructure
VM for cloud infrastructure
 
Cloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a serviceCloud Computing Environment using Cluster as a service
Cloud Computing Environment using Cluster as a service
 
Data Storage in Cloud computing
Data Storage in Cloud computingData Storage in Cloud computing
Data Storage in Cloud computing
 
Migrating into a cloud
Migrating into a cloudMigrating into a cloud
Migrating into a cloud
 
Cloud computing introduction
Cloud computing introductionCloud computing introduction
Cloud computing introduction
 

Dernier

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Dernier (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Data preprocessing PPT

  • 1. Data Preprocessing MS. T.K. ANUSUYA ASSISTANT PROFESSOR DEPARTMENT OF COMPUTER SCIENCE BON SECOURS COLLEGE FOR WOMEN, THANJAVUR.
  • 2. Why Data Pre-processing?  Data in real-world  Highly noisy, - errors or outliers  Missing/incomplete – lacking attribute values eg name=“”  Duplicate tuples  inconsistent data due to their typically huge size.  Low quality data  low quality mining results.  Different data sources  Data extraction, cleaning and transformation 2 Data Pre-processing
  • 3. Multi Dimensional Measure of Data Quality  Accuracy  Completeness  Consistency  Timeliness  Believability  Interpretability 3 Data Pre-processing
  • 4. Data Pre-processing Techniques  Data Cleaning  Data integration  Data reduction  Data transformation 4 Data Pre-processing
  • 5. Data Pre-processing Techniques  Data Cleaning  Missing values(noisy data), outliers , data’s are dirty  Data Integration  Integration of multiple databases, data cubes or files  Data Transformation  Normalization and aggregation  Data Reduction  Reduce data size,/compressed, aggregating, eliminating redundant features  Dimensionality reduction -removing irrelevant attributes  Numerosity reduction – replaced by alternatives, parametric models(regression /log linear models) or non parametric models(eg. Histograms, clusters, sampling and data aggregation) 5 Data Pre-processing
  • 6. Data Cleaning To fill in missing values, smooth out noisy while identifying outliers and correct inconsistencies in the data • Missing Values • Ignore the tuple – when class label is missing • Fill in the missing value manually –tedious and infeasible • Use a global constant to fill in the missing value – unknown a new class • Use a measure of central tendency for the attribute • Use the attribute mean or median for all samples belonging to the same class as the given tuple. • Use the most probable value to fill on the missing value. –regression, Bayesian formula, decision trees. 6 Data Pre-processing
  • 7. Data Cleaning • Noisy Data • Noise is a random error or variance in a measured variable. • Binning Method : sorting the data • Smooth by bin median, median and boundaries. • Clustering – detect and remove outliers • Semi Automated – Computer and Manual intervention • Regression – use regression functions 7 Data Pre-processing
  • 8. Data Integration  Data Integration  Merging of data from multiple data stores.  Reduce and avoid redundancies and inconsistencies  Improves the accuracy and speed of the mining process.  Entity identification problem  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Redundant attributes may be able to detected to correlation analysis and covariance analysis 8 Data Pre-processing
  • 9. Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected ExpectedObserved 2 2 )(  9 Data Pre-processing
  • 10. 10 Data Pre-processing Chi-square Calculation-example Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science fiction and playchess are correlated in the group 93.507 840 )8401000( 360 )360200( 210 )21050( 90 )90250( 2222 2         
  • 11. 11 Data Pre-processing Correlation Analysis (Numerical Data) Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product. If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. rA,B = 0: independent; rA,B < 0: negatively correlated BABA n BAnAB n BBAA r BA  )1( )( )1( ))(( ,       
  • 12. Data Reduction  Reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.  Data cube aggregation  Dimensionality reduction - reducing the random variables or attributes under consideration (Wavelet Transforms)  Numerosity reduction – Regression and log linear models, Histograms, Clustoring, Sampling Data cube aggregation  Data compression 12 Data Pre-processing
  • 13. Wavelet Transform  Data are transformed to preserve relative distance between objects at different levels of resolutions  Used for image compression 13 Data Pre-processing
  • 14. Numerosity Reduction  Reduce data volume by choosing alternative forms of data representation  Parametric Methods (Regressions)  Assume the data fits in models  Linear Regression -Straight line  Multiple Regression – multidimensional vector  Log linear model- discrete multidimensional distributions  Non-Parametric Methods  Don’t assume models (Histograms, clustering, sampling…) 14 Data Pre-processing
  • 15. Histograms  Popular Data reduction techniques  Divide and equal the data into buckets and store average for each bucket 15 Data Pre-processing
  • 16. Data Cube Aggregation  The lowest level of a data cube (Cubiod)  A cube is highest level of abstraction is apex cuboid  Multiple levels of aggregation in data cubes  Provide fast access to precomputed, summarized data.  Reduce the size of data 16 Data Pre-processing
  • 17. Data Transformation  Pre-processing step  Data are transformed or consolidated the resulting mining process may be more efficient and the patterns found.  Smoothing – remove noisy data (binning, regression and clustering)  Attribute construction – new attributes constructed  Aggregation –summarized, data cube  Normalization –(min-max, z-score)  Discretization –hierarchy climbing  Concept hierarchy generation for nominal data 17 Data Pre-processing
  • 18. Normalization  Min – maz normalization 9new mina, new maxA)  Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to  Z-score normalization (μ: mean, σ: standard deviation):  Normalization by decimal scaling where j is the smallest integer such that max v <1 AAA AA A minnewminnewmaxnew minmax minv v _)__('     A Av v   ' j v v' 18 Data Pre-processing
  • 19. Data Discretization  Three types of attributes:  Nominal — values from an unordered set, e.g., color, profession  Ordinal — values from an ordered set, e.g., military or academic rank  Continuous — real numbers, e.g., integer or real numbers  Discretization:  Divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes.  Reduce data size by discretization Data Pre-processing 19
  • 20. Data Discretization  Discretization  Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals  Interval labels can then be used to replace actual data values  Supervised vs. unsupervised  Split (top-down) vs. merge (bottom-up)  Discretization can be performed recursively on an attribute  Concept hierarchy formation  Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as young, middle-aged, or senior) Data Pre-processing 20
  • 21. Data Discretization Methods  Typical methods: All the methods can be applied recursively  Binning  Top-down split, unsupervised  Histogram analysis  Top-down split, unsupervised  Clustering analysis (unsupervised, top-down split or bottom-up merge)  Decision-tree analysis (supervised, top-down split)  Correlation (e.g., 2) analysis (unsupervised, bottom-up merge) 21 Data Pre-processing