SlideShare une entreprise Scribd logo
1  sur  22
Data Preprocessing

1
Data Preprocessing





Today’s real-world databases are highly susceptible to noisy,
missing, and inconsistent data due to their typically huge size
(often several gigabytes or more) and their likely origin from
multiple, heterogeneous sources.
Low-quality data will lead to low-quality mining results.
Process or steps to make a “raw data” into quality data ( good
input for mining tools).
Why Data Preprocessing?


Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“ ”
• noisy: containing errors or outliers
 e.g., Salary=“-10”
• inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

3
Why Is Data Preprocessing
Important?


No quality data, no quality mining results!
• Quality decisions must be based on quality data


e.g., duplicate or missing data may cause incorrect or even
misleading statistics.

• Data warehouse needs consistent integration of quality data


Data extraction, cleaning, and transformation involves the majority
of the work of building a data warehouse (90%).

4
DATA PROBLEMS
Major Tasks in Data
Preprocessing










Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
• Integration of multiple databases, data cubes, or files
Data transformation
• Normalization and aggregation
Data reduction
• Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
• Part of data reduction but with particular importance, especially
for numerical data

6
Forms of Data Preprocessing

7
Data Cleaning




Importance
• “Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data

• Correct inconsistent data
• Resolve redundancy caused by data integration

8
Noisy Data



Noise: random error or variance in a measured variable
Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems

9
Conti….





Noise: random error or variance in a measured variable
Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data

10
How to Handle Noisy Data?


Binning

• first sort data and partition into (equal-frequency)
bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.



Regression

• smooth by fitting the data into regression functions



Clustering

• detect and remove outliers



Combined computer and human inspection

• detect suspicious values and check by human (e.g.,
deal with possible outliers)

11
Cluster Analysis

12
Data Integration


Data integration:

• Combines data from multiple sources into a coherent
store





Schema integration: Integrate metadata from
different sources

Entity identification problem:

• Identify real world entities from multiple data
sources, e.g., Bill Clinton = William Clinton
• metadata can be used to help avoid errors in schema
integration



Detecting and resolving data value conflicts

• For the same real world entity, attribute values
from different sources are different
• Possible reasons: different
representations, different scales, e.g., Kg vs.
Pound

13
Handling Redundancy in Data Integration


Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or
object may have different names in
different databases
• Derivable data: One attribute may be a
“derived” attribute in another table, e.g.,
annual revenue





Redundant attributes may be able to be
detected by correlation analysis

Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed
and quality

14
Descriptive Data Summarization










For data preprocessing to be successful, you have an
overall picture of your data.
It can be used to identify the typical properties of your
data and highlight which data values should be treated
as noise or outliers.
Measures of central tendency include
mean, median, mode, and midrange
Midrange : It is the average of the largest and smallest
values in the set.
measures of data dispersion include
quartiles, interquartile range (IQR), and variance.

March 6, 2014

15
Data Transformation


Smoothing: remove noise from data(binning,
regression, and clustering)



Aggregation: summarization, data cube construction



Generalization: concept hierarchy climbing



Normalization: scaled to fall within a small, specified
range
• min-max normalization

• z-score normalization
• normalization by decimal scaling


Attribute/feature construction
• New attributes constructed from the given ones

16
Min-max normalization
Suppose that min_A and max_A are the minimum
and maximum values of an attribute A.
Min-max normalization maps a value v of A to v’ in
the range [new_min_A, new_max_A]

March 6, 2014

17
Data Reduction Strategies




Why data reduction?
• A database/data warehouse may store
terabytes of data
• Complex data analysis/mining may take a
very long time to run on the complete
data set
Data reduction
• Obtain a reduced representation of the
data set that is much smaller in volume
but yet produce the same (or almost the
same) analytical results

18
Data Reduction










1. Data cube aggregation, where aggregation operations are
applied to the data in the construction of a data cube.
2. Attribute subset selection, where irrelevant, weakly
relevant, or redundant attributes or dimensions may be detected
and removed.
3. Dimensionality reduction, where encoding mechanisms are
used to reduce the data set size.
Numerosity reduction: where the data are replaced or
estimated by alternative, smaller data representations
4. Discretization and concept hierarchy generation, where
raw data values for attributes are replaced by ranges or higher
conceptual levels.
• Data discretization is a form of multiplicity reduction that is
very useful for the automatic generation of concept
hierarchies.
• Discretization and concept hierarchy generation are powerful
tools for data mining, in that they allow the mining of data at
multiple levels of abstraction.

19
Data Cube Aggregation

March 6, 2014

20
Cluster Analysis


Clustering can be used to generate a
concept hierarchy for A by following
either a top-down splitting strategy
or a bottom-up merging strategy.

March 6, 2014

21
Concept Hierarchy Generation
for Categorical Data
Specification of a partial ordering of
attributes explicitly at the schema
level by users or experts
 Specification of a portion of a
hierarchy by explicit data grouping:


March 6, 2014

22

Contenu connexe

Tendances

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction StratergiesAnjaliSoorej
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Unit 3 Network Layer PPT
Unit 3 Network Layer PPTUnit 3 Network Layer PPT
Unit 3 Network Layer PPTKalpanaC14
 
Unit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxAnusuya123
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining Jeremiah Fadugba
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data CleaningCarolineSmith912130
 

Tendances (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data reduction
Data reductionData reduction
Data reduction
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Unit 3 Network Layer PPT
Unit 3 Network Layer PPTUnit 3 Network Layer PPT
Unit 3 Network Layer PPT
 
Data mining
Data miningData mining
Data mining
 
Object oriented database
Object oriented databaseObject oriented database
Object oriented database
 
Unit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptx
 
5desc
5desc5desc
5desc
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
The Growing Importance of Data Cleaning
The Growing Importance of Data CleaningThe Growing Importance of Data Cleaning
The Growing Importance of Data Cleaning
 

En vedette

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataLARCA UPC
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 

En vedette (7)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Adaptive pre-processing for streaming data
Adaptive pre-processing for streaming dataAdaptive pre-processing for streaming data
Adaptive pre-processing for streaming data
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data Processing
Data ProcessingData Processing
Data Processing
 

Similaire à Data pre processing

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptkannaradhas
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3asad199
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdfDimpyJindal4
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedYugal Kumar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingsuganmca14
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 

Similaire à Data pre processing (20)

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data processing
Data processingData processing
Data processing
 
Dmblog
DmblogDmblog
Dmblog
 
Chapter 2 Cond (1).ppt
Chapter 2 Cond (1).pptChapter 2 Cond (1).ppt
Chapter 2 Cond (1).ppt
 
Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
DM Lecture 3
DM Lecture 3DM Lecture 3
DM Lecture 3
 
My3prep
My3prepMy3prep
My3prep
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Data processing
Data processingData processing
Data processing
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 

Dernier

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 

Dernier (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

Data pre processing

  • 2. Data Preprocessing    Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogeneous sources. Low-quality data will lead to low-quality mining results. Process or steps to make a “raw data” into quality data ( good input for mining tools).
  • 3. Why Data Preprocessing?  Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation=“ ” • noisy: containing errors or outliers  e.g., Salary=“-10” • inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records 3
  • 4. Why Is Data Preprocessing Important?  No quality data, no quality mining results! • Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics. • Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation involves the majority of the work of building a data warehouse (90%). 4
  • 6. Major Tasks in Data Preprocessing      Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration • Integration of multiple databases, data cubes, or files Data transformation • Normalization and aggregation Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results Data discretization • Part of data reduction but with particular importance, especially for numerical data 6
  • 7. Forms of Data Preprocessing 7
  • 8. Data Cleaning   Importance • “Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration 8
  • 9. Noisy Data   Noise: random error or variance in a measured variable Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems 9
  • 10. Conti….    Noise: random error or variance in a measured variable Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data 10
  • 11. How to Handle Noisy Data?  Binning • first sort data and partition into (equal-frequency) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression • smooth by fitting the data into regression functions  Clustering • detect and remove outliers  Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers) 11
  • 13. Data Integration  Data integration: • Combines data from multiple sources into a coherent store   Schema integration: Integrate metadata from different sources Entity identification problem: • Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton • metadata can be used to help avoid errors in schema integration  Detecting and resolving data value conflicts • For the same real world entity, attribute values from different sources are different • Possible reasons: different representations, different scales, e.g., Kg vs. Pound 13
  • 14. Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases • Object identification: The same attribute or object may have different names in different databases • Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue   Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 14
  • 15. Descriptive Data Summarization      For data preprocessing to be successful, you have an overall picture of your data. It can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers. Measures of central tendency include mean, median, mode, and midrange Midrange : It is the average of the largest and smallest values in the set. measures of data dispersion include quartiles, interquartile range (IQR), and variance. March 6, 2014 15
  • 16. Data Transformation  Smoothing: remove noise from data(binning, regression, and clustering)  Aggregation: summarization, data cube construction  Generalization: concept hierarchy climbing  Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling  Attribute/feature construction • New attributes constructed from the given ones 16
  • 17. Min-max normalization Suppose that min_A and max_A are the minimum and maximum values of an attribute A. Min-max normalization maps a value v of A to v’ in the range [new_min_A, new_max_A] March 6, 2014 17
  • 18. Data Reduction Strategies   Why data reduction? • A database/data warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set Data reduction • Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results 18
  • 19. Data Reduction      1. Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. 3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. Numerosity reduction: where the data are replaced or estimated by alternative, smaller data representations 4. Discretization and concept hierarchy generation, where raw data values for attributes are replaced by ranges or higher conceptual levels. • Data discretization is a form of multiplicity reduction that is very useful for the automatic generation of concept hierarchies. • Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. 19
  • 21. Cluster Analysis  Clustering can be used to generate a concept hierarchy for A by following either a top-down splitting strategy or a bottom-up merging strategy. March 6, 2014 21
  • 22. Concept Hierarchy Generation for Categorical Data Specification of a partial ordering of attributes explicitly at the schema level by users or experts  Specification of a portion of a hierarchy by explicit data grouping:  March 6, 2014 22