SlideShare une entreprise Scribd logo
1  sur  13
WHY DATA PREPROCESSING?
Data in the real world is dirty
incomplete: missing attribute values, lack of certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“”
noisy: containing errors or outliers
 e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
MAJOR TASKS IN DATA PREPROCESSING
Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers
and noisy data, and resolve inconsistencies
Data integration
 Integration of multiple databases, or files
Data transformation
 Normalization and aggregation
Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization (for numerical data)
DATA CLEANING
Importance
 “Data cleaning is the number one problem in data
warehousing”
Data cleaning tasks – this routine attempts to
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
MISSING DATA
Data is not always available
 E.g., many tuples have no recorded values for several attributes,
such as customer income in sales data
Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
DATA INTEGRATION
Data integration:
 combines data from multiple sources(data cubes, multiple db or flat
files)
Issues during data integration
 Schema integration
 integrate metadata (about the data) from different sources
 Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id  B.cust-#(same entity?)
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from different
sources are different, e.g., different scales, metric vs. British units
 Removing duplicates and redundant data
 An attribute can be derived from another table (annual revenue)
 Inconsistencies in attribute naming
DATA TRANSFORMATION
Smoothing: remove noise from data (binning, clustering, regression)
Normalization: scaled to fall within a small, specified range such as
–1.0 to 1.0 or 0.0 to 1.0
Attribute/feature construction
 New attributes constructed / added from the given ones
Aggregation: summarization or aggregation operations apply to data
Generalization: concept hierarchy climbing
 Low level/ primitive/raw data are replace by higher level concepts
DATA REDUCTION STRATEGIES
Data is too big to work with – may takes time,
impractical or infeasible analysis
Data reduction techniques
Obtain a reduced representation of the data set that is
much smaller in volume but yet produce the same (or
almost the same) analytical results
Data reduction strategies
Data cube aggregation – apply aggregation operations
(data cube)
CLUSTERING
Partition data set into clusters, and one can store cluster representation only
Can be very effective if data is clustered but not if data is “smeared”/ spread
There are many choices of clustering definitions and clustering algorithms. We will
discuss them later.
SAMPLING
Data reduction technique
A large data set to be represented by much smaller
random sample or subset.
4 types
Simple random sampling without replacement
(SRSWOR).
Simple random sampling with replacement (SRSWR).
Develop adaptive sampling methods such as cluster
sample and stratified sample
DISCRETIZATION AND CONCEPT HIERARCHY
Discretization
 reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values
Concept hierarchies
 reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or senior)
SOME TECHNIQUES
-Binning methods – equal-width, equal-frequency
-Histogram
- Entropy-based methods
SUMMARY
Data preparation is a big issue for data mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
Many methods have been proposed but still an
active area of research

Contenu connexe

Tendances

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessingpurnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data MiningSamad Baseer Khan
 
Data pre processing
Data pre processingData pre processing
Data pre processingjunnubabu
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 

Tendances (14)

Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 

En vedette

Spss basic1
Spss basic1Spss basic1
Spss basic1UPM
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..killerkarthic
 
Introduction to spss – part 1
Introduction to spss – part 1Introduction to spss – part 1
Introduction to spss – part 1Dr. Vignes Gopal
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & GraphingJames Neill
 
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
BID CE workshop 1   session 08 - Biodiversity Data CleaningBID CE workshop 1   session 08 - Biodiversity Data Cleaning
BID CE workshop 1 session 08 - Biodiversity Data CleaningAlberto González-Talaván
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
Correlation in simple terms
Correlation in simple termsCorrelation in simple terms
Correlation in simple termsstats2analytics
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlationMaria Theresa
 
Correlation in physical science
Correlation in physical science Correlation in physical science
Correlation in physical science teenathankachen1993
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentationkesterdavid
 
Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS CausationColleen Carmean
 
One Way Anova
One Way AnovaOne Way Anova
One Way Anovashoffma5
 
Questionnaire Results and Analysis
Questionnaire Results and AnalysisQuestionnaire Results and Analysis
Questionnaire Results and Analysisantonia-roberts
 

En vedette (20)

Spss basic1
Spss basic1Spss basic1
Spss basic1
 
Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..Overview of DATA PREPROCESS..
Overview of DATA PREPROCESS..
 
Preprocess
PreprocessPreprocess
Preprocess
 
Introduction to spss – part 1
Introduction to spss – part 1Introduction to spss – part 1
Introduction to spss – part 1
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Descriptive statistics ii
Descriptive statistics iiDescriptive statistics ii
Descriptive statistics ii
 
Descriptives & Graphing
Descriptives & GraphingDescriptives & Graphing
Descriptives & Graphing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
BID CE workshop 1   session 08 - Biodiversity Data CleaningBID CE workshop 1   session 08 - Biodiversity Data Cleaning
BID CE workshop 1 session 08 - Biodiversity Data Cleaning
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Correlation in simple terms
Correlation in simple termsCorrelation in simple terms
Correlation in simple terms
 
Pa 298 measures of correlation
Pa 298 measures of correlationPa 298 measures of correlation
Pa 298 measures of correlation
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation in physical science
Correlation in physical science Correlation in physical science
Correlation in physical science
 
Costaatt spss presentation
Costaatt spss presentationCostaatt spss presentation
Costaatt spss presentation
 
Correlation VS Causation
Correlation VS CausationCorrelation VS Causation
Correlation VS Causation
 
Basic One-Way ANOVA
Basic One-Way ANOVABasic One-Way ANOVA
Basic One-Way ANOVA
 
One Way Anova
One Way AnovaOne Way Anova
One Way Anova
 
Questionnaire Results and Analysis
Questionnaire Results and AnalysisQuestionnaire Results and Analysis
Questionnaire Results and Analysis
 

Similaire à Data preprocessing

Similaire à Data preprocessing (20)

Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 

Dernier

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxAmita Gupta
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseAnaAcapella
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 

Dernier (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Data preprocessing

  • 1.
  • 2. WHY DATA PREPROCESSING? Data in the real world is dirty incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data  e.g., occupation=“” noisy: containing errors or outliers  e.g., Salary=“-10” inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 3. MAJOR TASKS IN DATA PREPROCESSING Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies Data integration  Integration of multiple databases, or files Data transformation  Normalization and aggregation Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results Data discretization (for numerical data)
  • 4. DATA CLEANING Importance  “Data cleaning is the number one problem in data warehousing” Data cleaning tasks – this routine attempts to  Fill in missing values  Identify outliers and smooth out noisy data  Correct inconsistent data  Resolve redundancy caused by data integration
  • 5. MISSING DATA Data is not always available  E.g., many tuples have no recorded values for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data
  • 6. DATA INTEGRATION Data integration:  combines data from multiple sources(data cubes, multiple db or flat files) Issues during data integration  Schema integration  integrate metadata (about the data) from different sources  Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-#(same entity?)  Detecting and resolving data value conflicts  for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units  Removing duplicates and redundant data  An attribute can be derived from another table (annual revenue)  Inconsistencies in attribute naming
  • 7. DATA TRANSFORMATION Smoothing: remove noise from data (binning, clustering, regression) Normalization: scaled to fall within a small, specified range such as –1.0 to 1.0 or 0.0 to 1.0 Attribute/feature construction  New attributes constructed / added from the given ones Aggregation: summarization or aggregation operations apply to data Generalization: concept hierarchy climbing  Low level/ primitive/raw data are replace by higher level concepts
  • 8. DATA REDUCTION STRATEGIES Data is too big to work with – may takes time, impractical or infeasible analysis Data reduction techniques Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Data cube aggregation – apply aggregation operations (data cube)
  • 9. CLUSTERING Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared”/ spread There are many choices of clustering definitions and clustering algorithms. We will discuss them later.
  • 10. SAMPLING Data reduction technique A large data set to be represented by much smaller random sample or subset. 4 types Simple random sampling without replacement (SRSWOR). Simple random sampling with replacement (SRSWR). Develop adaptive sampling methods such as cluster sample and stratified sample
  • 11. DISCRETIZATION AND CONCEPT HIERARCHY Discretization  reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Concept hierarchies  reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
  • 12. SOME TECHNIQUES -Binning methods – equal-width, equal-frequency -Histogram - Entropy-based methods
  • 13. SUMMARY Data preparation is a big issue for data mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization Many methods have been proposed but still an active area of research