SlideShare une entreprise Scribd logo
1  sur  20
Data Mining
Dr.J.Kalavathi. M.Sc., P.hD.,
Assistant Professor,
Department of Information Technology,
V.V.Vanniaperumal College for Women,
Virudhunagar.
Data mining aims at discovering relationships and other forms
of knowledge from data in the real world.
Data map entities in the application domain to symbolic
representation through a measurement function.
Data in the real world is dirty
incomplete: missing data, lacking attribute values, lacking
certain attributes of interest, or containing only aggregate data
noisy: containing errors, such as measurement errors, or outliers
inconsistent: containing discrepancies in codes or names
distorted: sampling distortion (A Change for worse)
• No quality data, no quality mining results
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of
quality data
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation(Convert the data into forms for
mining, removes noise from data )
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
 Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
(a) Missing values in data entry
• Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data
• Missing data may need to be inferred.
(b) Missing Values in existing data
Missing value methods can be used for handling missing data for existing
databases and for data left unknown during or not applicable during entry.
Methods of handling missing data
Ignore the tuples(instances) - the class label is missing (assuming your
data mining goal is classification), or many attributes are missing from the
row (not just one).
Fill in the missing value manually – search for all missing values and
replace them with appropriate values.
Use a global constant to fill in for missing values - Decide on a new
global constant value, like “unknown“, “N/A” or minus infinity, that will be
used to fill all the missing values.
Use attribute mean to fill in the missing values - Replace missing values of an
attribute with the mean (or median if its discrete) value for that attribute in the
database.
Use attribute mean for all samples belonging to the same class as the given
tuple - Instead of using the mean (or median) of a certain attribute calculated by
looking at all the rows in a database, we can limit the calculations to the relevant
class to make the value more relevant to the row we’re looking at.
Use a data mining algorithm to predict the most probable value - The value
can be determined using regression, inference based tools using Bayesian
formalism, decision trees, clustering algorithms (K-MeanMedian etc.).
EM (Expectation Maximization) Method –
Compute the expected value of the complete data record.
Substitute the missing values by the expected values.
Multiple imputations - this process creates data
matrices,containing actual raw data values to fill the gaps in an
existing database.
• Noisy data is a meaningless data that can’t be interpreted by machines.
• It can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
• Noise is a random error or variance in a measured variable.
• Noisy Data may be due to faulty data collection instruments, data
entry problems and technology limitation.
• Binning
• Clustering
• Regression
• Computer and Human
Inspection
• Binning methods sorted data value by consulting its “neighbor- hood,”
that is, the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
• The whole data is divided into segments of equal size and then various
methods are performed to complete the task.
• Each segmented is handled separately.
• One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
• Data mining technique which is used to fit an equation to a dataset
• Here data can be made smooth by fitting it to a regression function.
• The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
• Linear Regression
• 𝑌 = 𝑏 + 𝑚𝑥
• Mx -- > given value
• B -- > Prediction
• Combined computer and human inspection " detect
suspicious values and check by human (e.g., deal with
possible outliers)
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

Contenu connexe

Tendances

Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis finalAkul10
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingAmuthamca
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data MiningSamad Baseer Khan
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessingpurnimatm
 
Data pre processing
Data pre processingData pre processing
Data pre processingjunnubabu
 

Tendances (18)

Data preprocess
Data preprocessData preprocess
Data preprocess
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis final
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing in Data Mining
Data preprocessing  in Data MiningData preprocessing  in Data Mining
Data preprocessing in Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preparation
Data preparationData preparation
Data preparation
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Chap 8
Chap 8Chap 8
Chap 8
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 

Similaire à Data pre processing

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdfDimpyJindal4
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data PreprocessingT Kavitha
 
Data cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdfData cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdf9wldv5h8n
 

Similaire à Data pre processing (20)

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
DataPreProcessing
DataPreProcessing DataPreProcessing
DataPreProcessing
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
1234
12341234
1234
 
Data cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdfData cleaning, reduction and transformation.pdf
Data cleaning, reduction and transformation.pdf
 

Plus de kalavathisugan

Plus de kalavathisugan (13)

Serial Communication.pptx
Serial Communication.pptxSerial Communication.pptx
Serial Communication.pptx
 
Timer and counting.pptx
Timer and counting.pptxTimer and counting.pptx
Timer and counting.pptx
 
SS-assemblers 1.pptx
SS-assemblers 1.pptxSS-assemblers 1.pptx
SS-assemblers 1.pptx
 
SS-CISC -1.pptx
SS-CISC -1.pptxSS-CISC -1.pptx
SS-CISC -1.pptx
 
SS-SIC (1).pptx
SS-SIC (1).pptxSS-SIC (1).pptx
SS-SIC (1).pptx
 
Chapter 3.4.pptx
Chapter 3.4.pptxChapter 3.4.pptx
Chapter 3.4.pptx
 
Cloud Computing 1.3.pptx
Cloud Computing 1.3.pptxCloud Computing 1.3.pptx
Cloud Computing 1.3.pptx
 
Cloud computing 2.pptx
Cloud computing 2.pptxCloud computing 2.pptx
Cloud computing 2.pptx
 
Data reduction
Data reductionData reduction
Data reduction
 
Data integration
Data integrationData integration
Data integration
 
Games
GamesGames
Games
 
Functions in c
Functions in cFunctions in c
Functions in c
 
Structures in c
Structures in cStructures in c
Structures in c
 

Dernier

Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactisticshameyhk98
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationNeilDeclaro1
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 

Dernier (20)

Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 

Data pre processing

  • 1. Data Mining Dr.J.Kalavathi. M.Sc., P.hD., Assistant Professor, Department of Information Technology, V.V.Vanniaperumal College for Women, Virudhunagar.
  • 2. Data mining aims at discovering relationships and other forms of knowledge from data in the real world. Data map entities in the application domain to symbolic representation through a measurement function. Data in the real world is dirty incomplete: missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors, such as measurement errors, or outliers inconsistent: containing discrepancies in codes or names distorted: sampling distortion (A Change for worse)
  • 3. • No quality data, no quality mining results • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
  • 4.  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation(Convert the data into forms for mining, removes noise from data )  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data
  • 5.
  • 6.
  • 7.  Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration
  • 8. (a) Missing values in data entry • Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data • Missing data may need to be inferred.
  • 9. (b) Missing Values in existing data Missing value methods can be used for handling missing data for existing databases and for data left unknown during or not applicable during entry. Methods of handling missing data Ignore the tuples(instances) - the class label is missing (assuming your data mining goal is classification), or many attributes are missing from the row (not just one). Fill in the missing value manually – search for all missing values and replace them with appropriate values. Use a global constant to fill in for missing values - Decide on a new global constant value, like “unknown“, “N/A” or minus infinity, that will be used to fill all the missing values.
  • 10. Use attribute mean to fill in the missing values - Replace missing values of an attribute with the mean (or median if its discrete) value for that attribute in the database. Use attribute mean for all samples belonging to the same class as the given tuple - Instead of using the mean (or median) of a certain attribute calculated by looking at all the rows in a database, we can limit the calculations to the relevant class to make the value more relevant to the row we’re looking at. Use a data mining algorithm to predict the most probable value - The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms (K-MeanMedian etc.). EM (Expectation Maximization) Method – Compute the expected value of the complete data record. Substitute the missing values by the expected values. Multiple imputations - this process creates data matrices,containing actual raw data values to fill the gaps in an existing database.
  • 11. • Noisy data is a meaningless data that can’t be interpreted by machines. • It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : • Noise is a random error or variance in a measured variable. • Noisy Data may be due to faulty data collection instruments, data entry problems and technology limitation.
  • 12.
  • 13. • Binning • Clustering • Regression • Computer and Human Inspection
  • 14. • Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values around it. • The sorted values are distributed into a number of “buckets,” or bins. • The whole data is divided into segments of equal size and then various methods are performed to complete the task. • Each segmented is handled separately. • One can replace all data in a segment by its mean or boundary values can be used to complete the task.
  • 15. • first sort data and partition into (equi-depth) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
  • 16.
  • 17. • Data mining technique which is used to fit an equation to a dataset • Here data can be made smooth by fitting it to a regression function. • The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). • Linear Regression • 𝑌 = 𝑏 + 𝑚𝑥 • Mx -- > given value • B -- > Prediction
  • 18.
  • 19. • Combined computer and human inspection " detect suspicious values and check by human (e.g., deal with possible outliers)
  • 20. This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.