Contenu connexe


Pre-Processing and Data Preparation

  1. Pre Processing and Data Prepration Big Data & IoT Umair Shafique (03246441789) Scholar MS Information Technology - University of Gujrat
  2. Data Pre-Processing Learning Objectives •Understand what is data pre-processing? • Understand why pre-process the data. • Understand how to clean the data. • Understand how to integrate and transform the data.
  3. • Data: • Text • Images • Audio • Video • Data pre-processing is the process to convert raw data into meaningful data using different techniques. Data pre-processing
  4. • Incomplete: missing data, lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • Noisy: containing errors, such as measurement errors, or outliers • Inconsistent: containing discrepancies in codes or names • Distorted: sampling distortion (A Change for worse) Terms used in pre-processing
  5. Steps of Pre-processing • Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies and errors • Data integration Integration of multiple databases, data cubes, or files • Data transformation Normalization and aggregation • Data reduction Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization Part of data reduction but with particular importance, especially for numerical data
  6. steps of data pre-processing
  7. Data Cleaning • Data cleaning tasks i. Fill in missing values ii. Identify outliers and smooth out noisy data iii. Correct inconsistent data
  8. Data Cleaning 1. Missing Data • Data is not always available a. E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to a. equipment malfunction b. inconsistent with other recorded data and thus deleted c. data not entered due to misunderstanding d. certain data may not be considered important at the time of entry e. not register history or changes of the data f. Missing data may need to be inferred.
  9. How to Handle Missing Data? • Ignore the tuple; usually done when the class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably.) • Fill in the missing value manually • Use a global constant to fill in the missing value: e.g., “unknown”, a new class • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  10. 2. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention • Other data problems which requires data cleaning – duplicate records – inconsistent data
  11. How to Handle Noisy Data? • Binning method: - first sort data and partition into (equi-depth) bins - then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering - detect and remove outliers • Regression - smooth by fitting the data into regression functions • Combined computer and human inspection - detect suspicious values and check by human
  12. Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4,8,9,15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 1. Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 2. Smoothing by bin boundaries: - Bin 1: 4, 4, 15, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 34, 34
  13. Cluster Analysis
  14. Regression
  15. Data Integration • Data integration: • combines data from multiple sources into a coherent store • Schema integration • integrate metadata from different sources • entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id  B.cust-# • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different • possible reasons: different representations, different scales, e.g., metric vs. British units
  16. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant data may be able to be detected by correlational analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  17. Data Transformation • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling • Attribute/feature construction • New attributes constructed from the given ones
  18. Data Reduction • Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.
  19. Data reduction Strategies 1. Data cube aggregation 2. Dimension reduction 3. Data compression 4. Numerosity reduction 5. Discretization and concept hierarchies
  20. 1- Data Cube Aggregation Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube.
  21. 2- dimension reduction • Dimension Reduction, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. • The goal is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. • Mining on a reduced set of attributes has an additional benefit. It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
  22. 2.1- methods of dimension reduction Basic heuristic methods of attribute dimension reduction include the following techniques: o Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. o Stepwise backward elimination The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. o Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that, at each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
  23. 3- Data compression • In data compression, where encoding mechanisms are used to reduce the data set size. • Types of compressions: • Lossless Compression Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data. • Lossy Compression Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them.
  24. 4- numerosity reduction • Here, the data are replaced or estimated by alternative, smaller data representations • Techniques: • Parametric methods A model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) Log-linear models, which estimate discrete multidimensional probability distributions, are an example • Non-parametric methods For storing reduced representations of the data include histograms, clustering, and sampling.
  25. 5- discretization and concept heirarchy Discretization: • Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals • Types of discretization: • Top-down discretization If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting. • Bottom-up discretization If you first consider all the constant values as split-points, some are discarded through a combination of the neighbourhood values in the interval, that process is called bottom-up discretization.
  26. 5- discretization and concept hierarchy Concept Hierarchy: • It reduces the data size by collecting and then replacing the low-level concepts • Techniques: 1. Binning Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user. 2. Histogram Analysis Used to partition the value for the attribute X, into disjoint ranges called bracket
  27. 5- discretization and concept hierarchy • Histogram partitioning rules: •Equal frequency partitioning Partitioning the values based on their number of occurrences in the data set. •Equal width partitioning Partitioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20 •Clustering Grouping the similar data together
  28. Benefits of data preparation 1. Ensure the data used in analytics applications produces reliable results; 2. Identify and fix data issues that otherwise might not be detected; 3. Enable more informed decision-making by business executives and operational workers; 4. Reduce data management and analytics costs; 5. Avoid duplication of effort in preparing data for use in multiple applications; and 6. Get a higher ROI from BI and analytics initiatives.
  29. Summary • Data preparation or preprocessing is a big issue for both data warehousing and data mining • Descriptive data summarization is needed for quality data preprocessing • Data preparation includes Data cleaning and data integration Data transformation and reduction Discretization • A lot a methods have been developed but data preprocessing still an active area of research