SlideShare une entreprise Scribd logo
1  sur  65
CD404- Introduction to Data Science
Data Collection strategies
Data Preprocessing
ETL (Extract, Transform, and Load)
Extract, Transform and Load is the technique of
extracting the record from sources (which is present
outside or on-premises, etc.) to a staging area, then
transforming or reformatting with business manipulation
performed on it in order to fit the operational needs or
data analysis, and later loading into the goal or
destination databases or data warehouse.
ETL v/s ELT
Types of Analytics
Descriptive Analytics –What happened?
Diagnostics Analytics – Why happened?
Predictive Analytics- What will happen?
Prescriptive Analytics – what should we do?
Data Collection
To analyze and make decisions about a certain business,
sales, etc., data will be collected. This collected data will
help in making some conclusions about the performance of a
particular business.
Thus, data collection is essential to analyze the performance
of a business unit, solving a problem and making
assumptions about specific things when required.
Data Science Process Model
Frame the problem – Objective to be identified
Collect the raw data needed for your problem
Process the data for analysis -EDA
Data Visualisation
Dimensionality Reduction
Model Building
Definition: In Statistics, data collection is a process of
gathering information from all the relevant sources to find a
solution to the research problem. Most of the organizations
use data collection methods to make assumptions about
future probabilities and trends.
Primary Data Collection methods
Secondary Data Collection methods
Primary data or raw data is a type of information that is obtained
directly from the first-hand source through experiments, surveys or
observations.
Quantitative Data Collection Methods
It is based on mathematical calculations using various formats and
statistical methods, mean, median or mode measures.
Qualitative Data Collection Methods
It does not involve any mathematical calculations. This method is
closely associated with elements that are not quantifiable. This
qualitative data collection method includes interviews,
questionnaires, observations, case studies, etc.
Secondary data is data collected by someone other than the
actual user. It means that the information is already available, and
someone analyses it. The secondary data includes magazines,
newspapers, books, journals, etc. It may be either published data
or unpublished data.
Published data are available in various resources including
Government publications
Public records
Historical and statistical documents
Business documents
Technical and trade journals
Data Repositories
Unpublished Data : Raw copy before publication
Outline
• Why data preprocessing?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency,
timeliness, believability, value added,
interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and
accessibility.
Dirty Data
• incomplete
• noisy
• inconsistent
• No Quality Data
Multidimensional measure of quality of data
 Accuracy
 completeness
 consistency
 Timeliness
 Reliability
 Accessibility
 Interpretability
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
– Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but
produces the same or similar analytical results
– Data discretization: with particular importance,
especially for numerical data
– Data aggregation, dimensionality reduction,
data compression,generalization
Forms of data preprocessing : Data cleaning or
transformation
diagrammatic representation on next slide
For ex. -2,32,100 ( single digit/2 digit/3 digit) transofrmed
into 0 to 1 scale, 0.02,0.32,1.0
Outline
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
Missing data
•Not availability of data
•Equipment malfunctioning
•Inconsistent, thus deleted
•Data not entered
•Certain data may not be important at the time of entry
How to handle missing data?
•Manually entry
•Attribute mean
•Standardization
•normalization
DataFrame : an object useful in representing data in form of
rows and columns.
Once data is stored in dataframe , we can perform operations to
analyse and understand data.
import pandas as pd
import xlrd
df= pd.read_excel(path, “Sheet1”)
df
Sample Dataset
Countrydata=
[['India‘, 38.0, 68000.0] ,
['France‘, 43.0, 45000.0],
['Germany‘, 30.0, 54000.0],
['France' ,48.0,NaN]
]
List or tuple or dictionary
Df=pd.DataFrame(countrydata,
columns=[“country”,”no_states”,”Area”])
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Dataset.csv')
#Viewing Dataframe , position index
x= data_set.iloc[:, [0:2]]
#Using column names
y= data_set.loc[:, [‘country’,’area’]]
'India‘, 38.0, 68000.0
'France‘, 43.0, 45000.0
’Germany‘, 30.0, 54000.0
’France' ,48.0,NaN
Country no_states area
0 India 38.0 68000.0
1
2
3
Operations
df.shape (rows,columns)
df.head(), df.head(2) default first 5 rows,
df.tail(), df.tail(4) default last 5 rows
df[2:5], df[0::2] intial,final,step value rows
df.columns Index[‘ ‘,’ ‘,’ ‘] column
names
df.empid or df[‘empid’] list of columns to be
passed
df[‘area’].min()
df[‘area’].max()
df.describe()
Count,mean,std,min,25%,50%,75%,max of all
coumns
df1=df.sort_values(‘country’)
variance measures variability from the average or
mean. It is calculated by taking the differences
between each number in the data set and the mean,
then squaring the differences to make them positive,
and finally dividing the sum of the squares by the
number of values in the data set.
Standard Deviation is square root of variance
measures
Missing data handling
df1=df.fillna(0)
df1=df.fillname({‘columnname’;value})
df1=df.dropna()
df.isnull().sum() ============= >zero initially
df[‘column’].mean()
df[‘column’].fillna(df[[‘column’].mean(), inplace=True)
df[‘column’].fillna(df[[‘column’].mode(), inplace=True)
df[‘column’].fillna(df[[‘column’].median(), inplace=True)
df.isnull().sum() ============= ==>zero
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
– used also for discretization
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and
human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size: uniform
grid
– if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing
approximately same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
• Sorted data for price (in dollars): 4, 8, 9, 15,
21, 21, 24, 25, 26, 28, 29, 34
• Equal-width no of bins:3
• 34-4=30/3=10
• Bin 1: 4..4+10==4..14 [4,8,9]
• Bin 2:15..15+10==15..25 [15,21,21,24,25]
• Bin 3:26..26+10==26..36 [26,28,29,34]
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
• Smoothing by bin median:
- Bin 1: 9 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: (closest boundary)
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Question
• Data:11,13,13,15,15,16,19,20,20,20,21,21,2
2,23,24,30,40,45,45,45,71,72,73,75
• Data:5, 10, 11, 13, 15, 35, 50, 55, 72, 92,
204, 215
• a) Smoothing by bin mean
• b) Smoothing by bin median
• c) Smoothing by bin boundaries
• Perform equal-width/equal-depth binning
• For the both methods, the best way of
determining k is by looking at the histogram
and try different intervals or groups.
Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values
Histograms
• Approximate data
distributions- frequency
distribution of continuous
values
• Divide data into buckets
• A bucket represents an
attribute-value/frequency
pair- range of values is bin-
height of bar represents
frequency of data point in
bin 0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
import numpy as np
import math
from sklearn.datasets import load_iris
# load iris data set
dataset = load_iris()
a = dataset.data
b = np.zeros(150)
# take 1st column among 4 column of data set
for i in range (150):
b[i]=a[i,1]
b=np.sort(b) #sort the array
• # create bins
• bin1=np.zeros((30,5))
• bin2=np.zeros((30,5))
• bin3=np.zeros((30,5))
# Bin mean
for i in range (0,150,5):
k=int(i/5)
mean=(b[i] + b[i+1] + b[i+2] + b[i+3] +
b[i+4])/5
for j in range(5):
bin1[k,j]=mean
print("Bin Mean: n",bin1)
Cluster Analysis
Select seed point randomly
Calculate distance of each point with seed ( called as centroid )
and form cluster with min. distance
Check the density and select new centroid
Formulate new clusters until optimality
Outlier points will be separated
Clustering
• Partition data set into clusters, and store cluster representation only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-
dimensional index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms
Outlier Treatment
Q1=df[‘area’].quantile(0.05)
Q2=df[‘area’].quantile(0.95)
df['a'] = np.where((df.a < Q1), Q1, df.a)
df.loc[(df.a > Q2), Q2, df.a)
Univariate outliers can be found when looking at a
distribution of values in a single feature space.
Multivariate outliers can be found in an n-
dimensional space (of n-features).
Point outliers are single data points that lay far from
the rest of the distribution.
Contextual outliers can be noise in data, such as
punctuation symbols when realizing text analysis
Collective outliers can be subsets of novelties in
data
[1,35,20,32,40,46,45,4500]
Regression
x
y
y = x + 1
X1
Y1
Y1’
•Linear regression (best line to fit
two variables)
•Multiple linear regression (more
than two variables, fit to a
multidimensional surface
Regression and Log-Linear Models
• Linear regression: Data are modeled to fit a straight
line:
– Often uses the least-square method to fit the line
• Multiple regression: allows a response variable y to
be modeled as a linear function of multidimensional
feature vector (predictor variables)
• Log-linear model: approximates discrete
multidimensional joint probability distributions
• Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to be
estimated by using the data at hand.
– using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated by
a product of lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd
Regression Analysis and Log-Linear Models
Summary
• Data preparation is a big issue for both warehousing
and mining
• Data preparation includes
– Data cleaning and data integration
– Data reduction and feature selection
– Discretization
• A lot a methods have been developed but still an
active area of research
Numericals
1.Calculate variance and standard deviation for the
following data:
x 2,4,6,8,10
f 3,5,9,5,3
Ans: mean 6,var 5.44, std dev 2.33
2. Marks obtained by 5 students are 15,18,12,19 and 11.
Calculate std deviation, and variance
3. Calculate median 6, 2, 7, 9, 4, 1
4, 89, 65, 11, 54, 11, 90, 56,34
References
Data Preprocessing in Data Mining
Salvador García, Julián Luengo, Francisco Herrera (Springer)
MCQs
To remove noise and inconsistent data ____ is needed.
(a)Data Cleaning
(b)Data Transformation
(c)Data Reduction
(d)Data Integration
Multiple data sources may be combined is called as _____
(a)Data Reduction
(b)Data Cleaning
(c)Data Integration
(d)Data Transformation
A _____ is a collection of tables, each of which is assigned a
unique name which uses the entity-relationship (ER) data
model.
(a)Relational database
(b)Transactional database
(c)Data Warehouse
(d)Spatial database
_____ studies the collection, analysis, interpretation or
explanation, and presentation of data.
(a)Statistics
(b)Visualization
(c)Data Mining
(d)Clustering
_____ investigates how computers can learn (or improve
their performance) based on data.
(a)Machine Learning
(b)Artificial Intelligence
(c)Statistics
(d)Visualization
_____ is the science of searching for documents or
information in documents.
(a)Data Mining
(b)Information Retrieval
(c)Text Mining
(d)Web Mining
Data often contain _____
(a)Target Class
(b)Uncertainty
(c)Methods
(d)Keywords
In real world multidimensional view of data mining, The
major dimensions are data, knowledge, technologies, and
_____
(a)Methods
(b)Applications
(c)Tools
(d)Files
An _____ is a data field, representing a characteristic or
feature of a data object.
(a)Method
(b)Variable
(c)Task
(d)Attribute
The values of a _____ attribute are symbols or names of
things.
(a)Ordinal
(b)Nominal
(c)Ratio
(d)Interval
“Data about data” is referred to as _____
(a)Information
(b)Database
(c)Metadata
(d)File
______ partitions the objects into different groups.
(a)Mapping
(b)Clustering
(c)Classification
(d)Prediction
In _____, the attribute data are scaled so as to fall
within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0.
(a)Aggregation
(b)Binning
(c)Clustering
(d)Normalization
Normalization by ______ normalizes by moving the
decimal point of values of attributes.
(a)Z-Score
(b)Z-Index
(c)Decimal Scaling
(d)Min-Max Normalization
Used to transform the raw data in a useful and efficient
format.
a)Data Preparation
(b)Data Transformation
(c)Clustering
(d)Normalization
_______ is a top-down splitting technique based on a
specified number of bins.
(a)Normalization
(b)Binning
(c)Clustering
(d)Classification
Cluster Is
(a) A cluster is a subset of similar objects
(b) A subset of objects such that the distance between any of
the two objects in the cluster is less than the distance
between any object in the cluster and any object that is not
located inside it.
(c) A connected region of a multidimensional space with a
comparatively high density of objects.
(d) All of these
Data Preprocessing
Preprocessing in
Data Mining:
Data
preprocessing is a
data mining
technique which is
used to transform
the raw data in a
useful and
efficient format.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To
handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can
be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is
quite large and multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill
the missing values manually, by attribute mean or the most
probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by
machines. It can be generated due to faulty data collection, data entry
errors etc. It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The
whole data is divided into segments of equal size and then
various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a
segment by its mean or boundary values can be used to complete
the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount
of data. While working with huge volume of data, analysis became
harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the
data cube.
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.
Numerosity Reduction:
This enable to store the model of data instead of whole data, for
example: Regression Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are:Wavelet transforms and
PCA (Principal Component Analysis).
Wavelet Transforms
The general procedure for applying a discrete wavelet
transform uses a hierarchical pyramid algorithm that halves
the data in each iteration, resulting in fast computational
speed. The method is as follows:
Take input data vector of the length, L (integer power of 2)
The two functions : sum or weighted average and weighted
difference are applied to pairs of input data, resulting in two
sets of data of length L/2
The two functions are recursively applied to sets of data
obtained in the previous loop, until the resulting data sets
obtained are of length 2.
Sampling
Sampling can be used as a data reduction technique since it allows a larger
data set to be represented by a much smaller random (or subset) of the
data. Suppose a large data set D, contains N tuples some of the possible
samples for D are:
• Simple random sample without replacement of size n: This created by
drawing n of the N tuples from D (n<N) where the probability of drawing
any tuple in D is I/N, that is all the tuples are equally likely.
•Simple random sample with replacement of size n: This is similar to the
above except that each time a tuple is drawn from D, it is recorded and
then replaced. That is after a tuple is drawn, it is placed back in D so that it
could be drawn again.
•Cluster sample: If the tuples in D are grouped into M mutually
disjount”clusters”then a simple random sample of m clusters can be
obtained, where m<M
•Stratified sample: If D is divided into mutually disjoint parts called strata,
a stratified random sample is obtained by simple random sample at each
stratum

Contenu connexe

Similaire à Preprocessing_new.ppt

Data preprocessing
Data preprocessingData preprocessing
Data preprocessingextraganesh
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptxChandra Meena
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Dhilsath Fathima
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptxLuminous8
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis finalAkul10
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Data Mining Implementation process.pptx
Data Mining Implementation process.pptxData Mining Implementation process.pptx
Data Mining Implementation process.pptxLithal Fragrance
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxbajajrishabh96tech
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 

Similaire à Preprocessing_new.ppt (20)

Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
DATA preprocessing.pptx
DATA preprocessing.pptxDATA preprocessing.pptx
DATA preprocessing.pptx
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis final
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Data Mining Implementation process.pptx
Data Mining Implementation process.pptxData Mining Implementation process.pptx
Data Mining Implementation process.pptx
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 

Dernier

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 

Dernier (20)

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

Preprocessing_new.ppt

  • 1. CD404- Introduction to Data Science Data Collection strategies Data Preprocessing
  • 2. ETL (Extract, Transform, and Load) Extract, Transform and Load is the technique of extracting the record from sources (which is present outside or on-premises, etc.) to a staging area, then transforming or reformatting with business manipulation performed on it in order to fit the operational needs or data analysis, and later loading into the goal or destination databases or data warehouse.
  • 4. Types of Analytics Descriptive Analytics –What happened? Diagnostics Analytics – Why happened? Predictive Analytics- What will happen? Prescriptive Analytics – what should we do?
  • 5. Data Collection To analyze and make decisions about a certain business, sales, etc., data will be collected. This collected data will help in making some conclusions about the performance of a particular business. Thus, data collection is essential to analyze the performance of a business unit, solving a problem and making assumptions about specific things when required.
  • 6. Data Science Process Model Frame the problem – Objective to be identified Collect the raw data needed for your problem Process the data for analysis -EDA Data Visualisation Dimensionality Reduction Model Building
  • 7. Definition: In Statistics, data collection is a process of gathering information from all the relevant sources to find a solution to the research problem. Most of the organizations use data collection methods to make assumptions about future probabilities and trends. Primary Data Collection methods Secondary Data Collection methods
  • 8. Primary data or raw data is a type of information that is obtained directly from the first-hand source through experiments, surveys or observations. Quantitative Data Collection Methods It is based on mathematical calculations using various formats and statistical methods, mean, median or mode measures. Qualitative Data Collection Methods It does not involve any mathematical calculations. This method is closely associated with elements that are not quantifiable. This qualitative data collection method includes interviews, questionnaires, observations, case studies, etc.
  • 9. Secondary data is data collected by someone other than the actual user. It means that the information is already available, and someone analyses it. The secondary data includes magazines, newspapers, books, journals, etc. It may be either published data or unpublished data. Published data are available in various resources including Government publications Public records Historical and statistical documents Business documents Technical and trade journals Data Repositories Unpublished Data : Raw copy before publication
  • 10. Outline • Why data preprocessing? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary
  • 11. Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data
  • 12. • A multi-dimensional measure of data quality: – A well-accepted multi-dimensional view: • accuracy, completeness, consistency, timeliness, believability, value added, interpretability, accessibility – Broad categories: • intrinsic, contextual, representational, and accessibility.
  • 13. Dirty Data • incomplete • noisy • inconsistent • No Quality Data Multidimensional measure of quality of data  Accuracy  completeness  consistency  Timeliness  Reliability  Accessibility  Interpretability
  • 14. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, files, or notes • Data transformation – Normalization (scaling to a specific range) – Aggregation
  • 15. • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results – Data discretization: with particular importance, especially for numerical data – Data aggregation, dimensionality reduction, data compression,generalization Forms of data preprocessing : Data cleaning or transformation diagrammatic representation on next slide For ex. -2,32,100 ( single digit/2 digit/3 digit) transofrmed into 0 to 1 scale, 0.02,0.32,1.0
  • 16.
  • 17. Outline • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation • Summary
  • 18. Data Cleaning • Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data
  • 19. Missing Data • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data • Missing data may need to be inferred
  • 20. Missing data •Not availability of data •Equipment malfunctioning •Inconsistent, thus deleted •Data not entered •Certain data may not be important at the time of entry How to handle missing data? •Manually entry •Attribute mean •Standardization •normalization
  • 21. DataFrame : an object useful in representing data in form of rows and columns. Once data is stored in dataframe , we can perform operations to analyse and understand data. import pandas as pd import xlrd df= pd.read_excel(path, “Sheet1”) df
  • 22. Sample Dataset Countrydata= [['India‘, 38.0, 68000.0] , ['France‘, 43.0, 45000.0], ['Germany‘, 30.0, 54000.0], ['France' ,48.0,NaN] ] List or tuple or dictionary Df=pd.DataFrame(countrydata, columns=[“country”,”no_states”,”Area”])
  • 23. # importing libraries import numpy as nm import matplotlib.pyplot as mtp import pandas as pd #importing datasets data_set= pd.read_csv('Dataset.csv') #Viewing Dataframe , position index x= data_set.iloc[:, [0:2]] #Using column names y= data_set.loc[:, [‘country’,’area’]] 'India‘, 38.0, 68000.0 'France‘, 43.0, 45000.0 ’Germany‘, 30.0, 54000.0 ’France' ,48.0,NaN Country no_states area 0 India 38.0 68000.0 1 2 3
  • 24. Operations df.shape (rows,columns) df.head(), df.head(2) default first 5 rows, df.tail(), df.tail(4) default last 5 rows df[2:5], df[0::2] intial,final,step value rows df.columns Index[‘ ‘,’ ‘,’ ‘] column names df.empid or df[‘empid’] list of columns to be passed
  • 26. variance measures variability from the average or mean. It is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set. Standard Deviation is square root of variance measures
  • 28. df.isnull().sum() ============= >zero initially df[‘column’].mean() df[‘column’].fillna(df[[‘column’].mean(), inplace=True) df[‘column’].fillna(df[[‘column’].mode(), inplace=True) df[‘column’].fillna(df[[‘column’].median(), inplace=True) df.isnull().sum() ============= ==>zero
  • 29. How to Handle Noisy Data? • Binning method: – first sort data and partition into (equi-depth) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. – used also for discretization • Clustering – detect and remove outliers • Semi-automated method: combined computer and human inspection – detect suspicious values and check manually • Regression – smooth by fitting the data into regression functions
  • 30. Simple Discretization Methods: Binning • Equal-width (distance) partitioning: – It divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. – The most straightforward – But outliers may dominate presentation – Skewed data is not handled well. • Equal-depth (frequency) partitioning: – It divides the range into N intervals, each containing approximately same number of samples – Good data scaling – Managing categorical attributes can be tricky.
  • 31. • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Equal-width no of bins:3 • 34-4=30/3=10 • Bin 1: 4..4+10==4..14 [4,8,9] • Bin 2:15..15+10==15..25 [15,21,21,24,25] • Bin 3:26..26+10==26..36 [26,28,29,34]
  • 32. Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
  • 33. • Smoothing by bin median: - Bin 1: 9 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: (closest boundary) - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 34. Question • Data:11,13,13,15,15,16,19,20,20,20,21,21,2 2,23,24,30,40,45,45,45,71,72,73,75 • Data:5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215 • a) Smoothing by bin mean • b) Smoothing by bin median • c) Smoothing by bin boundaries • Perform equal-width/equal-depth binning
  • 35. • For the both methods, the best way of determining k is by looking at the histogram and try different intervals or groups. Discretization – reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
  • 36. Histograms • Approximate data distributions- frequency distribution of continuous values • Divide data into buckets • A bucket represents an attribute-value/frequency pair- range of values is bin- height of bar represents frequency of data point in bin 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 37. import numpy as np import math from sklearn.datasets import load_iris # load iris data set dataset = load_iris() a = dataset.data b = np.zeros(150)
  • 38. # take 1st column among 4 column of data set for i in range (150): b[i]=a[i,1] b=np.sort(b) #sort the array • # create bins • bin1=np.zeros((30,5)) • bin2=np.zeros((30,5)) • bin3=np.zeros((30,5))
  • 39. # Bin mean for i in range (0,150,5): k=int(i/5) mean=(b[i] + b[i+1] + b[i+2] + b[i+3] + b[i+4])/5 for j in range(5): bin1[k,j]=mean print("Bin Mean: n",bin1)
  • 41. Select seed point randomly Calculate distance of each point with seed ( called as centroid ) and form cluster with min. distance Check the density and select new centroid Formulate new clusters until optimality Outlier points will be separated
  • 42. Clustering • Partition data set into clusters, and store cluster representation only • Quality of clusters measured by their diameter (max distance between any two objects in the cluster) or centroid distance (avg. distance of each cluster object from its centroid) • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering (possibly stored in multi- dimensional index tree structures (B+-tree, R-tree, quad-tree, etc)) • There are many choices of clustering definitions and clustering algorithms
  • 43. Outlier Treatment Q1=df[‘area’].quantile(0.05) Q2=df[‘area’].quantile(0.95) df['a'] = np.where((df.a < Q1), Q1, df.a) df.loc[(df.a > Q2), Q2, df.a)
  • 44. Univariate outliers can be found when looking at a distribution of values in a single feature space. Multivariate outliers can be found in an n- dimensional space (of n-features). Point outliers are single data points that lay far from the rest of the distribution. Contextual outliers can be noise in data, such as punctuation symbols when realizing text analysis Collective outliers can be subsets of novelties in data [1,35,20,32,40,46,45,4500]
  • 45. Regression x y y = x + 1 X1 Y1 Y1’ •Linear regression (best line to fit two variables) •Multiple linear regression (more than two variables, fit to a multidimensional surface
  • 46. Regression and Log-Linear Models • Linear regression: Data are modeled to fit a straight line: – Often uses the least-square method to fit the line • Multiple regression: allows a response variable y to be modeled as a linear function of multidimensional feature vector (predictor variables) • Log-linear model: approximates discrete multidimensional joint probability distributions
  • 47. • Linear regression: Y =  +  X – Two parameters ,  and  specify the line and are to be estimated by using the data at hand. – using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. • Multiple regression: Y = b0 + b1 X1 + b2 X2. – Many nonlinear functions can be transformed into the above. • Log-linear models: – The multi-way table of joint probabilities is approximated by a product of lower-order tables. – Probability: p(a, b, c, d) = ab acad bcd Regression Analysis and Log-Linear Models
  • 48. Summary • Data preparation is a big issue for both warehousing and mining • Data preparation includes – Data cleaning and data integration – Data reduction and feature selection – Discretization • A lot a methods have been developed but still an active area of research
  • 49. Numericals 1.Calculate variance and standard deviation for the following data: x 2,4,6,8,10 f 3,5,9,5,3 Ans: mean 6,var 5.44, std dev 2.33 2. Marks obtained by 5 students are 15,18,12,19 and 11. Calculate std deviation, and variance 3. Calculate median 6, 2, 7, 9, 4, 1 4, 89, 65, 11, 54, 11, 90, 56,34
  • 50. References Data Preprocessing in Data Mining Salvador García, Julián Luengo, Francisco Herrera (Springer)
  • 51. MCQs To remove noise and inconsistent data ____ is needed. (a)Data Cleaning (b)Data Transformation (c)Data Reduction (d)Data Integration Multiple data sources may be combined is called as _____ (a)Data Reduction (b)Data Cleaning (c)Data Integration (d)Data Transformation
  • 52. A _____ is a collection of tables, each of which is assigned a unique name which uses the entity-relationship (ER) data model. (a)Relational database (b)Transactional database (c)Data Warehouse (d)Spatial database _____ studies the collection, analysis, interpretation or explanation, and presentation of data. (a)Statistics (b)Visualization (c)Data Mining (d)Clustering
  • 53. _____ investigates how computers can learn (or improve their performance) based on data. (a)Machine Learning (b)Artificial Intelligence (c)Statistics (d)Visualization _____ is the science of searching for documents or information in documents. (a)Data Mining (b)Information Retrieval (c)Text Mining (d)Web Mining Data often contain _____ (a)Target Class (b)Uncertainty (c)Methods (d)Keywords
  • 54. In real world multidimensional view of data mining, The major dimensions are data, knowledge, technologies, and _____ (a)Methods (b)Applications (c)Tools (d)Files An _____ is a data field, representing a characteristic or feature of a data object. (a)Method (b)Variable (c)Task (d)Attribute
  • 55. The values of a _____ attribute are symbols or names of things. (a)Ordinal (b)Nominal (c)Ratio (d)Interval “Data about data” is referred to as _____ (a)Information (b)Database (c)Metadata (d)File ______ partitions the objects into different groups. (a)Mapping (b)Clustering (c)Classification (d)Prediction
  • 56. In _____, the attribute data are scaled so as to fall within a smaller range, such as -1.0 to 1.0, or 0.0 to 1.0. (a)Aggregation (b)Binning (c)Clustering (d)Normalization Normalization by ______ normalizes by moving the decimal point of values of attributes. (a)Z-Score (b)Z-Index (c)Decimal Scaling (d)Min-Max Normalization Used to transform the raw data in a useful and efficient format. a)Data Preparation (b)Data Transformation (c)Clustering (d)Normalization
  • 57. _______ is a top-down splitting technique based on a specified number of bins. (a)Normalization (b)Binning (c)Clustering (d)Classification Cluster Is (a) A cluster is a subset of similar objects (b) A subset of objects such that the distance between any of the two objects in the cluster is less than the distance between any object in the cluster and any object that is not located inside it. (c) A connected region of a multidimensional space with a comparatively high density of objects. (d) All of these
  • 58. Data Preprocessing Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.
  • 59. 1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. (a). Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are: Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.
  • 60. (b). Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. Regression: Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
  • 61. 2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process. Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual levels. Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.
  • 62. 3. Data Reduction: Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs. The various steps to data reduction are: Data Cube Aggregation: Aggregation operation is applied to data for the construction of the data cube. Attribute Subset Selection: The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute.the attribute having p-value greater than significance level can be discarded.
  • 63. Numerosity Reduction: This enable to store the model of data instead of whole data, for example: Regression Models. Dimensionality Reduction: This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are:Wavelet transforms and PCA (Principal Component Analysis).
  • 64. Wavelet Transforms The general procedure for applying a discrete wavelet transform uses a hierarchical pyramid algorithm that halves the data in each iteration, resulting in fast computational speed. The method is as follows: Take input data vector of the length, L (integer power of 2) The two functions : sum or weighted average and weighted difference are applied to pairs of input data, resulting in two sets of data of length L/2 The two functions are recursively applied to sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2.
  • 65. Sampling Sampling can be used as a data reduction technique since it allows a larger data set to be represented by a much smaller random (or subset) of the data. Suppose a large data set D, contains N tuples some of the possible samples for D are: • Simple random sample without replacement of size n: This created by drawing n of the N tuples from D (n<N) where the probability of drawing any tuple in D is I/N, that is all the tuples are equally likely. •Simple random sample with replacement of size n: This is similar to the above except that each time a tuple is drawn from D, it is recorded and then replaced. That is after a tuple is drawn, it is placed back in D so that it could be drawn again. •Cluster sample: If the tuples in D are grouped into M mutually disjount”clusters”then a simple random sample of m clusters can be obtained, where m<M •Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified random sample is obtained by simple random sample at each stratum