Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Data preprocessing using Machine Learning

Prochain SlideShare
Data preprocessing
Data preprocessing
Chargement dans…3

Consultez-les par la suite

1 sur 38 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Data preprocessing using Machine Learning (20)


Plus récents (20)

Data preprocessing using Machine Learning

  1. 1. Dr. Gopal Sakarkar, IEEE-CIS Member, Ph.D(CSE) Department of AI and Machine Learning , G H RaisoniCollegeof Engineering , Nagpur Data Pre-processing Services using Machine Learning Algorithms
  2. 2. Data Cleaning Services Good data preparation is key to producing valid and reliable models.
  3. 3. Applications of Machine Learning
  4. 4. Applications of Machine Learning
  5. 5. Applications of Machine Learning
  6. 6. Applications of Machine Learning
  7. 7. What is Machine Learning? • According to Arthur Samuel(1959), Machine Learning algorithms enable the computers to learn from data, and even improve themselves, without being explicitly programmed. • Machine learning (ML) is a category of an algorithm that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. • The basic premise of machine learning is to build algorithms that can receive input data and use statistical analysis to predict an output while updating outputs as new data becomes available.
  8. 8. Types of Machine Learning
  9. 9. Types of Machine Learning Supervised Learning Unsupervised Learning MachineLearningAlgorithms
  10. 10. Where is Data Cleaning used? Machine Learning Life Cycle
  11. 11. Data Pre-processing • Data preprocessing is an important step in ML • The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. • It involves transforming raw data into an understandable format. • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. • Data preprocessing is a proven method of resolving such issues
  12. 12. Why Data Pre-processing?
  13. 13. Why Data Pre-processing? • A manager at All Electronics and have been charged with analyzing the company's data with respect to the sales at a branch. • He carefully inspect the company's database and data warehouse, identifying dimensions to be included, such as item, price, units sold, and session . • He notice that several of the attributes for various tuples have no recorded value. For analysis, he would like to include information. • In other words, the data he wish to analyze by machine learning techniques is incomplete, noisy and inconsistent.
  14. 14. Why Data Pre-processing? Item Price Unit Sold Session TV 7200 44 All Fan 480 27 Summer Tube light 54 30 All AC 27000 38 Fridge 40 Summer Switches 58 35 2 mm Wire 520 All Backup Light 790 48 Winter Fan Regulator 83 50 All Bulb 87 37 Rainy Session
  15. 15. What do you mean by data Pre-processing ? • It is cleaning and explorating data for analysis • Prepping data for modeling • Modeling in Python requires numerical input • Data preprocessing is a technique that involves transforming raw data into an understandable format. • Data preprocessing is a proven method of resolving such issues.
  16. 16. Data Understanding : Relevance of data • What data is available for the task? • Is this data relevant? • Is additional relevant data available? • How much historical data is available?
  17. 17. Data Understanding: Quantity of data • Number of instances (records, objects) • Rule of thumb: 5,000 or more desired • if less, results are less reliable; use special methods (boosting, …) • Number of attributes (fields) • Rule of thumb: for each attribute, 10 or more instances • If more fields, use feature reduction and selection • if very unbalanced, use sampling
  18. 18. Data Pre-processing Steps
  19. 19. Data Pre-processing Steps
  20. 20. • Data Cleaning Data cleaning is process of fill in missing values, smoothing the noisy data, identify or remove outliers, and resolve inconsistencies. • Data Integration Integration of multiple databases, data cubes, or files. • Data Transformation Data transformation is the task of data normalization and aggregation. Data Pre-processing Steps
  21. 21. • Data Reduction Process of reduced representation in volume but produces the same or similar analytical results. • Data Discretization Part of data reduction but with particular importance, especially for numerical data. Data Pre-processing Steps
  22. 22. Data Pre-processing Steps
  23. 23. Data Cleaning • Importance Data cleaning is the number one problem during working with large data. Data Cleaning Tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Resolve redundancy caused by data integration
  24. 24. Data Cleaning: Missing Data • Data is not always available E.g., while admission filling form by student at the time of admission, he might be don’t known local guardian contact number. • Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry no register history or changes of the data expansion of data schema
  25. 25. How to Handle Missing Data? • Ignore the tuple (loss of information) • Fill in missing values manually: tedious, infeasible? • Fill in it automatically with a global constant : e.g., unknown, a new class?! Imputation: Use the attribute mean to fill in the missing value,  Use the most probable value to fill in the missing value.
  26. 26. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention • Other data problems which requires data cleaning  duplicate records  incomplete data  inconsistent data
  27. 27. How to handle noisy data? • Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Combined computer and human inspection detect suspicious values and check by human
  28. 28. Binning Methods for Data Smoothing • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: -Bin 1: 4, 8, 9, 15 -Bin 2: 21, 21, 24, 25 -Bin 3: 26, 28, 29, 34 * Smoothing by bin means: -Bin 1: 9, 9, 9, 9 (4+8+9+15/4) =9 -Bin 2: 23, 23, 23, 23 (21+21+24+25/4)=23 -Bin 3: 29, 29, 29, 29 (26+28+29+34/4)=29 * Smoothing by bin boundaries: -Bin 1: 4, 4, 4, 15 -Bin 2: 21, 21, 25, 25 -Bin 3: 26, 26, 26, 34
  29. 29. Data Integration Data integration: Its combines data from multiple sources • Schema integration Integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different, e.g., different scales, metric vs. British units • Removing duplicates and redundant data
  30. 30. Data Transformation Data Transformation • Smoothing: remove noise from data • Normalization: scaled to fall within a small, specified range • Attribute/feature construction  New attributes constructed from the given ones • Aggregation: summarization  Integrate data from different sources (tables)
  31. 31. Data Reduction • Data is too big to work with  Too many instances  too many features (attributes) Data Reduction  Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results (easily said but difficult to do) • Data reduction strategies  Dimensionality reduction — remove unimportant attributes  Aggregation and clustering –  Remove redundant or close associated ones  Sampling
  32. 32. Data Reduction Clustering • Partition data set into clusters, and one can store cluster representation only. • Can be very effective if data is clustered but not if data is dirty. • There are many choices of clustering and clustering algorithms.
  33. 33. Data Reduction Sampling • Choose a representative subset of the data  Simply selecting random sampling may have improve performance in the presence of scenario . • Develop adaptive sampling methods  Stratified sampling:  Approximate the percentage of each class (or subpopulation of interest) in the overall database
  34. 34. Data Reduction Sampling
  35. 35. Data Discretization • Discretization is a process that transforms quantitative data into qualitative data. • It significantly improve the quality of discovering knowledge. • It reduces the running time of various machine learning tasks such as association rule discovery, classification, clustering and prediction. • It reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. • Interval labels can then be used to replace actual data values
  36. 36. Data Discretization
  37. 37. Email: gopal.sakarkar@raisoni.net
  38. 38. Part 2 Implementation of Data Cleaning Services Using Python Programming