SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Exploratory
Data Analysis
.
01
.
02
.
03
04
Agenda
What is Exploratory Data Analysis?
Why EDA is important?
Visualisation
▪ Important Charts for visualisation
Steps Involved in EDA:-
▪ Data Sourcing
▪ Data Cleaning
▪ Univariate analysis with visualisation
▪ Bivariate analysis with visualisation
▪ Derived metrics
Uses Cases
05
Data Science Process
Exploratory
Data Analysis
.
❑ Exploratory Data Analysis is an approach to
analyse the datasets to summarize their
main characteristics in form of visual
methods.
❑ EDA is nothing but an data exploration
technique to understand various aspects of
the data.
❑ The main aim of EDA is to obtain confidence
in a data to an extent where we are ready to
engage a machine learning model.
What is
Exploratory
Data
Analysis?
Why EDA is
Important?
❖ EDA is important to analyse the data it’s a
first steps in data analysis process.
❖ EDA give a basic idea to understand the data
and make sense of the data to figure out the
question you need to ask and find out the
best way to manipulate the dataset to get
the answer of your question.
❖ Exploratory data analysis help us to finding
the errors, discovering data, mapping out
data structure, finding out anomalies.
❖ Exploratory data analysis is important for
business process because we are preparing
dataset for deep through analysis that will
detect you business problem.
❖ EDA help to build a quick and dirty model,
or a baseline model, which can serve as a
comparison against later models that you
will build.
Visualization
.
.
Easily analyse the data and summarize it.
Easily understand the features of the data.
Help to find the trend or pattern of the
data.
Help to get meaningful insights from the
data.
Visualisation is the presentation of the data in the graphical
or visual form to understand the data more clearly.
Visualisation is easy to understand the data.
Histogram represent the frequency distribution of
the data
Histogram
Bar graph represent the total observation in the data for a
particular category.
Bar Chart
Important Charts for Visualisation
Pie chart represent the percentage of the data by each
category.
Pie Chart
Boxplot display the distribution of the data based on five number
summary(minimum, first quartile, median, third quartile, maximum)
Box Plot
Scatter plot represent the relationship between two
numerical variable. It show the correlation between two
variable.
Scatter Plot
Pair plot show the bivariate distribution of the datasets. It show the
pairwise relationship between the variable
Pair-plot
Line chart are used to track change over line and short
period of time. Line chart are used in time series data.
Line Chart
Violin chart are used to plot numeric data
Violin Plot
Box Plot vs ViolinPlot
● Violin Graphs are better than Box Plots
● Violin graph is visually intuitive & attractive.
Image Courtesy: https://blog.bioturing.com/
A heat map is data analysis software that uses color
the way a bar graph uses height and width.
Heatmaps
Strip plots are used to draw a scatter plot for on the
categorical basis
Strip Plot
Steps Involved in EDA
01
03
05
02
04
.
Data Sourcing
Data Cleaning
Univariate Analysis
with Visualisation
Bivariate Analysis with
Visualisation
Derived Metrics
Data
Sourcing
Data Sourcing is the process of gathering data
from multiple sources as external or internal
data collection.
There are two major kind of data which can
be classified according to the source:
1. Public data 2. Private data
Public Data:- The data which is easy to access
without taking any permission from the agencies
is called public data. The agencies made the data
public for the purpose of the research. Like
government and other public sector or
ecommerce sites made there data public.
Private Data:- The data which is not available
on public platform and to access the data we
have to take the permission of organisation is
called private data. Like Banking ,telecom ,retail
sector are there which not made their data
publicly available.
Data
Cleaning
• After collecting the data , the next step is
data cleaning. Data cleaning means that you
get rid of any information that doesn’t need
to be there and clean up by mistake.
• Data Cleaning is the process of clean the data
to improve the quality of the data for further
data analysis and building a machine
learning model.
• The benefit of data cleaning is that all the
incorrect and irrelevant data is gone and we
get the good quality of data which will help
in improving the accuracy of our machine
learning model.
The following are some steps involve in Data
Cleaning:
Handle Missing Values
Standardisation of the data
Outlier Treatment
Handle Invalid values
Some Questions you need to ask yourself about the data before
cleaning the data
.
.
.
.
.
what you are reading, does it
make sense?
Does this data make sense?
Does this data you’re looking at
match the column labels?
Does the data follow the rules for this
field?
After compute the summarize
statistics for numerical data, does it
make sense?
Handle Missing Value
Option
A
Option
B
Option
C
Option
D
Prediction model is one of the advanced method to handle missing values. In this
method dataset with no missing value become training set and dataset with
missing value become the test set and the missing values is treated as target
variable.
Predicting the missing values
This method can be used on independent variable when it has numerical
variables. On categorical feature we apply mode method to fill the missing value.
.
Replacing with mean/median/mode
This method we commonly used to handle missing values. Rows can be deleted if it has
insignificant number of missing value Columns can be delete if it has more than 75% of missing
value
Some machine learning algorithm supports to handle missing value in the
datasets. Like KNN, Naïve Bayes, Random forest.
.
Algorithm Imputation
Delete Rows/Columns
Airlines Ticket Price
Indigo 3887
Air Asia 7662
Jet Airways -
Air India 5221
SpiceJet 4321
Suppose we have Airlines ticket price data in which there is missing value.
Steps to fill the numeric missing value:-
1) Compute the mean/median of the data
(3887+7662+5221+4321)/4 = 5272.75
2) Substitute the Mean of the value in missing place.
Airlines Ticket price
Indigo 3887
Indigo 7675
Air Asia 4236
- 6524
Jet Airways 4321
Suppose we have Missing values in the categorical data:
Then we take the mode of the dataset a to fill the missing values:
Here :
Mode = Indigo
We substitute the Indigo in place of missing value in Airline column.
For categorical Data
For Numerical Data Example
Standardization/
Feature Scaling
Feature scaling is the method to rescale the values
present in the features. In feature scaling we convert
the scale of different measurement into a single
scale. It standardise the whole dataset in one range.
Importance of Feature Scaling:- When we are
dealing with independent variable or features that
differ from each other in terms of range of values or
units of the features, then we have to
normalise/standardise the data so that the difference
in range of values doesn’t affect the outcome of the
data.
Feature Scaling Technique
Standard scaler ensures that for each feature, the mean is
zero and the standard deviation is 1,bringing all feature to the
same magnitude. In simple words Standardization helps you
to scale down your feature based on the standard normal
distribution
Standard Scaler
Normalization helps you to scale down your features
between a range 0 to 1
Min-Max Scaler
.
Normalization
Standardization
Age Income
(£)
New value
24 15000 (15000 – 12000)/18000 = 0.16667
30 12000 (12000 – 12000)/18000 =0
28 30000 (30000 – 12000)/18000 =1
Average = (15000 + 12000 + 30000)/3 = 19000
Standard deviation = ??
Income Minimum = 12000
Income Maximum = 30000
(Max – min) = (30000 – 12000) = 18000
Hence, we have converted the income values between 0 and 1
Please note, the new values have
Minimum = 0
Maximum = 1
Age Income
(£)
New value of Income
24 15000 (15000 - 19000)/std = ??
30 12000 (12000 - 19000)/std = ??
28 30000 (30000-19000)/std = ??
Standardization
Normalisation
Example
.
Outlier Treatment
Outliers are the most extremes values in the data. It is a
abnormal observations that deviate from the norm.
Outliers do not fit in the normal behaviour of the data.
Detect Outliers using following methods:
1.Boxplot 2. Histogram 3. Scatter plot
4.Z-score
5. Interquartile range(values out of 1.5 time of IQR)
Handle Outlier using following methods:-
1.Remove the outliers.
2.Replace outlier with suitable values by using following
methods:-
• Quantile method
• Interquartile range
3.Use that ML model which are not sensitive to outliers
Like:-KNN,Decision Tree,SVM,NaïveBayes,Ensemble
methods
.
Handle Invalid Value
• Encode Unicode properly:-In case the data is being read as junk
characters, try to change encoding, E.g. CP1252 instead of UTF-8.
• Convert incorrect data types:- Correct the incorrect data types
to the correct data types for ease of analysis. E.g. if numeric values
are stored as strings, it would not be possible to calculate metrics
such as mean, median, etc. Some of the common data type
corrections are — string to number: "12,300" to “12300”; string
to date: "2013-Aug" to “2013/08”; number to string: “PIN Code
110001” to "110001"; etc.
• Correct values that go beyond range:- If some of the values are
beyond logical range, e.g. temperature less than -273° C (0° K),
you would need to correct them as required. A close look would
help you check if there is scope for correction, or if the value
needs to be removed.
• Correct wrong structure:- Values that don’t follow a defined
structure can be removed. E.g. In a data set containing pin codes
of Indian cities, a pin code of 12 digits would be an invalid value
and needs to be removed. Similarly, a phone number of 12 digits
would be an invalid value
Univariate Analysis
Frequency
Central Tendency
.
Dispersion
.
Univariate analysis deal with analysing one
feature at a time. Univariate analysis is
important to understand the single variable at
a time. The main purpose of univariate
analysis is to make data easier to interpret.
Mainly, Histogram is used to visualize
univariate
Measure of frequency include frequency table, where
you tabulate how often a particular category or
particular value appear in the dataset.
It is a typical value for a probability distribution for
univariate measure of central tendency we have
Mean, Median, Mode.
It tries to capture how your data points jump
around.
Univariate Types:
Types of Data
Qualitative Quantitative
A variable which able to describe quality of
the population.(Numerical values)
A variable which quantify the
population(distinct categories)
Ordinal Discrete Continuous
Nominal
Nominal
Ordinal
.
Discrete
.
Continuous
It represent qualitative information
without order. Value represent a
discrete units.
Like Gender: Male/Female ,Eye colour.
It represent qualitative information with order.
It indicate the measurement classification are
different and can be ranked. Lets say Economic
status: high/medium/low which can ordered as
low,medium,high.
A number within a range of a value is
usually measured, such as height.
It has a discrete value that means it
take only counted value not a
decimal values. Like count of
student in class
Types of Data
.
Segmented
Univariate Analysis
Segmented Univariate Analysis allow you to compare
subset of data it help us to understand how the
relevant metric varies across the different segment.
The Standard process of segmented univariate
analysis is as follow:
• Take a raw data
• Group by dimensions
• Summarise using a relevant metric like mean
,median.
• Compare the aggregate metric across the
categories.
BivariateAnalysis
Correlation
.
.
Data which has two variables ,you often
want to measure the relationship that
exists between these two variables.
Correlation measure the strength as well as the
direction of the linear relationship between the two
variables. Its range is from -1 to +1.
Bi-variate Types:
Covariance measure how much two random
variable vary together. Its range is from -∞ to +∞.
Covariance
● If one increases as the other increases, the correlation is positive
● If one decreases as the other increases, the correlation is negative
● If one stays constant as the other varies, the correlation is zero
To see the distribution of two categorical variables. For example, if you want to compare the number of boys and girls who
play games, you can make a ‘cross table’ as given below:
To see the distribution of two categorical variables with one continuous variable. For example, you saw how a student’s
percentage in science is distributed based on the father’s occupation (categorical variable 1) and the poverty level
(categorical variable 2).
Example
Derived Metrics
Feature Binning
.
.
Derived metrics create a new variable from the
existing variable to get a insightful information
from the data by analysing the data.
Feature Encoding
From Domain Knowledge
Calculated from Data
.
Feature Binning
Feature binning converts or transform
continuous/numeric variable to categorical
variable. It can also be used to identify missing
values or outliers.
Type of Binning:-
1. Unsupervised Binning
a)Equal width binning
b)Equal frequency binning
2. Supervised Binning
a) Entropy based binning
.
Supervised Binning
.
Un Supervised Binning
It transform continuous or numeric
variable into categorical value without
taking dependent variable into
consideration
It transform continuous or numeric
variable into categorical value taking
dependent variable into consideration.
Equal Width
Equal width separate the continuous variable to several
categories having same range of width.
Equal Frequency
Equal frequency separate the continuous variable into
several categories having approximately same number of
values.
Entropy Based
Entropy based binning separate the continuous
or numeric variable majority of values in a
category belong to same label of class.
Feature Encoding
Feature encoding help us to transform categorical data into numeric data.
Label encoding
Label encoding is technique to transform
categorical variables into numerical variables
by assigning a numerical value to each of the
categories.
One-Hot encoding
This technique is used when independent
variables are nominal. It creates k different
columns each for a category and replaces one
column with 1 rest of the columns is 0.
Here, 0 represents the absence,
and 1 represents the presence of
that category.
Target Encoding
In target encoding, we calculate the average of
the dependent variable for each category and
replace the category variable with the mean
value
TheHashencoder represents categorical independent
variable using the new dimensions. Here, the user
can fix the number of dimensions after
transformation using component argument.
Hash Encoder
Uses Cases
Uses
Cases
Basically EDA is important in every business problem,
it’s a first crucial step in data analysis process.
Some of the use cases where we use EDA is:-
• Cancer Data Analysis :- In this data set we have to
predict who are suffering from cancer and who’s
not.
• Fraud Data Analysis in E-commerce
Transactions :- In this dataset we have to detect
the fraud in a E-commerce transaction.
THANK YOU

Contenu connexe

Tendances

Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisEva Durall
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAttaullah Khan
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisYabebal Ayalew
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfJamieDornan2
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDouglas Joubert
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to StatisticsRobert Tinaro
 
Data analysis
Data analysisData analysis
Data analysisLizzyL1
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationgzargary
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021Marié Roux
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwaresDr.ammara khakwani
 

Tendances (20)

Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdfExploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
3 data visualization
3 data visualization3 data visualization
3 data visualization
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 
Data analysis
Data analysisData analysis
Data analysis
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Data analysis
Data analysisData analysis
Data analysis
 
DATA ANALYSIS
DATA ANALYSISDATA ANALYSIS
DATA ANALYSIS
 
Data analytics
Data analyticsData analytics
Data analytics
 
Statistics and data science
Statistics and data scienceStatistics and data science
Statistics and data science
 

Similaire à Exploratory Data Analysis - Satyajit.pdf

Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessingKnoldus Inc.
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - ReportAkanksha Gohil
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.pptDeadpool120050
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxEDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxsaurav3107pandey
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 

Similaire à Exploratory Data Analysis - Satyajit.pdf (20)

ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
1234
12341234
1234
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
Data processing
Data processingData processing
Data processing
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
 
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptxEDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
EDA_Revision_Session_1cdeba87-6912-4236-ba3b-079a5463bf00.pptx
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 

Dernier

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Dernier (20)

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 

Exploratory Data Analysis - Satyajit.pdf

  • 2. . 01 . 02 . 03 04 Agenda What is Exploratory Data Analysis? Why EDA is important? Visualisation ▪ Important Charts for visualisation Steps Involved in EDA:- ▪ Data Sourcing ▪ Data Cleaning ▪ Univariate analysis with visualisation ▪ Bivariate analysis with visualisation ▪ Derived metrics Uses Cases 05
  • 4. Exploratory Data Analysis . ❑ Exploratory Data Analysis is an approach to analyse the datasets to summarize their main characteristics in form of visual methods. ❑ EDA is nothing but an data exploration technique to understand various aspects of the data. ❑ The main aim of EDA is to obtain confidence in a data to an extent where we are ready to engage a machine learning model. What is Exploratory Data Analysis?
  • 5. Why EDA is Important? ❖ EDA is important to analyse the data it’s a first steps in data analysis process. ❖ EDA give a basic idea to understand the data and make sense of the data to figure out the question you need to ask and find out the best way to manipulate the dataset to get the answer of your question. ❖ Exploratory data analysis help us to finding the errors, discovering data, mapping out data structure, finding out anomalies. ❖ Exploratory data analysis is important for business process because we are preparing dataset for deep through analysis that will detect you business problem. ❖ EDA help to build a quick and dirty model, or a baseline model, which can serve as a comparison against later models that you will build.
  • 6. Visualization . . Easily analyse the data and summarize it. Easily understand the features of the data. Help to find the trend or pattern of the data. Help to get meaningful insights from the data. Visualisation is the presentation of the data in the graphical or visual form to understand the data more clearly. Visualisation is easy to understand the data.
  • 7. Histogram represent the frequency distribution of the data Histogram Bar graph represent the total observation in the data for a particular category. Bar Chart Important Charts for Visualisation
  • 8. Pie chart represent the percentage of the data by each category. Pie Chart Boxplot display the distribution of the data based on five number summary(minimum, first quartile, median, third quartile, maximum) Box Plot
  • 9. Scatter plot represent the relationship between two numerical variable. It show the correlation between two variable. Scatter Plot Pair plot show the bivariate distribution of the datasets. It show the pairwise relationship between the variable Pair-plot
  • 10. Line chart are used to track change over line and short period of time. Line chart are used in time series data. Line Chart Violin chart are used to plot numeric data Violin Plot
  • 11. Box Plot vs ViolinPlot ● Violin Graphs are better than Box Plots ● Violin graph is visually intuitive & attractive. Image Courtesy: https://blog.bioturing.com/
  • 12. A heat map is data analysis software that uses color the way a bar graph uses height and width. Heatmaps Strip plots are used to draw a scatter plot for on the categorical basis Strip Plot
  • 13. Steps Involved in EDA 01 03 05 02 04 . Data Sourcing Data Cleaning Univariate Analysis with Visualisation Bivariate Analysis with Visualisation Derived Metrics
  • 14. Data Sourcing Data Sourcing is the process of gathering data from multiple sources as external or internal data collection. There are two major kind of data which can be classified according to the source: 1. Public data 2. Private data Public Data:- The data which is easy to access without taking any permission from the agencies is called public data. The agencies made the data public for the purpose of the research. Like government and other public sector or ecommerce sites made there data public. Private Data:- The data which is not available on public platform and to access the data we have to take the permission of organisation is called private data. Like Banking ,telecom ,retail sector are there which not made their data publicly available.
  • 15. Data Cleaning • After collecting the data , the next step is data cleaning. Data cleaning means that you get rid of any information that doesn’t need to be there and clean up by mistake. • Data Cleaning is the process of clean the data to improve the quality of the data for further data analysis and building a machine learning model. • The benefit of data cleaning is that all the incorrect and irrelevant data is gone and we get the good quality of data which will help in improving the accuracy of our machine learning model. The following are some steps involve in Data Cleaning: Handle Missing Values Standardisation of the data Outlier Treatment Handle Invalid values
  • 16. Some Questions you need to ask yourself about the data before cleaning the data . . . . . what you are reading, does it make sense? Does this data make sense? Does this data you’re looking at match the column labels? Does the data follow the rules for this field? After compute the summarize statistics for numerical data, does it make sense?
  • 17. Handle Missing Value Option A Option B Option C Option D Prediction model is one of the advanced method to handle missing values. In this method dataset with no missing value become training set and dataset with missing value become the test set and the missing values is treated as target variable. Predicting the missing values This method can be used on independent variable when it has numerical variables. On categorical feature we apply mode method to fill the missing value. . Replacing with mean/median/mode This method we commonly used to handle missing values. Rows can be deleted if it has insignificant number of missing value Columns can be delete if it has more than 75% of missing value Some machine learning algorithm supports to handle missing value in the datasets. Like KNN, Naïve Bayes, Random forest. . Algorithm Imputation Delete Rows/Columns
  • 18. Airlines Ticket Price Indigo 3887 Air Asia 7662 Jet Airways - Air India 5221 SpiceJet 4321 Suppose we have Airlines ticket price data in which there is missing value. Steps to fill the numeric missing value:- 1) Compute the mean/median of the data (3887+7662+5221+4321)/4 = 5272.75 2) Substitute the Mean of the value in missing place. Airlines Ticket price Indigo 3887 Indigo 7675 Air Asia 4236 - 6524 Jet Airways 4321 Suppose we have Missing values in the categorical data: Then we take the mode of the dataset a to fill the missing values: Here : Mode = Indigo We substitute the Indigo in place of missing value in Airline column. For categorical Data For Numerical Data Example
  • 19. Standardization/ Feature Scaling Feature scaling is the method to rescale the values present in the features. In feature scaling we convert the scale of different measurement into a single scale. It standardise the whole dataset in one range. Importance of Feature Scaling:- When we are dealing with independent variable or features that differ from each other in terms of range of values or units of the features, then we have to normalise/standardise the data so that the difference in range of values doesn’t affect the outcome of the data.
  • 20. Feature Scaling Technique Standard scaler ensures that for each feature, the mean is zero and the standard deviation is 1,bringing all feature to the same magnitude. In simple words Standardization helps you to scale down your feature based on the standard normal distribution Standard Scaler Normalization helps you to scale down your features between a range 0 to 1 Min-Max Scaler . Normalization Standardization
  • 21. Age Income (£) New value 24 15000 (15000 – 12000)/18000 = 0.16667 30 12000 (12000 – 12000)/18000 =0 28 30000 (30000 – 12000)/18000 =1 Average = (15000 + 12000 + 30000)/3 = 19000 Standard deviation = ?? Income Minimum = 12000 Income Maximum = 30000 (Max – min) = (30000 – 12000) = 18000 Hence, we have converted the income values between 0 and 1 Please note, the new values have Minimum = 0 Maximum = 1 Age Income (£) New value of Income 24 15000 (15000 - 19000)/std = ?? 30 12000 (12000 - 19000)/std = ?? 28 30000 (30000-19000)/std = ?? Standardization Normalisation Example
  • 22. . Outlier Treatment Outliers are the most extremes values in the data. It is a abnormal observations that deviate from the norm. Outliers do not fit in the normal behaviour of the data. Detect Outliers using following methods: 1.Boxplot 2. Histogram 3. Scatter plot 4.Z-score 5. Interquartile range(values out of 1.5 time of IQR) Handle Outlier using following methods:- 1.Remove the outliers. 2.Replace outlier with suitable values by using following methods:- • Quantile method • Interquartile range 3.Use that ML model which are not sensitive to outliers Like:-KNN,Decision Tree,SVM,NaïveBayes,Ensemble methods
  • 23. . Handle Invalid Value • Encode Unicode properly:-In case the data is being read as junk characters, try to change encoding, E.g. CP1252 instead of UTF-8. • Convert incorrect data types:- Correct the incorrect data types to the correct data types for ease of analysis. E.g. if numeric values are stored as strings, it would not be possible to calculate metrics such as mean, median, etc. Some of the common data type corrections are — string to number: "12,300" to “12300”; string to date: "2013-Aug" to “2013/08”; number to string: “PIN Code 110001” to "110001"; etc. • Correct values that go beyond range:- If some of the values are beyond logical range, e.g. temperature less than -273° C (0° K), you would need to correct them as required. A close look would help you check if there is scope for correction, or if the value needs to be removed. • Correct wrong structure:- Values that don’t follow a defined structure can be removed. E.g. In a data set containing pin codes of Indian cities, a pin code of 12 digits would be an invalid value and needs to be removed. Similarly, a phone number of 12 digits would be an invalid value
  • 24. Univariate Analysis Frequency Central Tendency . Dispersion . Univariate analysis deal with analysing one feature at a time. Univariate analysis is important to understand the single variable at a time. The main purpose of univariate analysis is to make data easier to interpret. Mainly, Histogram is used to visualize univariate Measure of frequency include frequency table, where you tabulate how often a particular category or particular value appear in the dataset. It is a typical value for a probability distribution for univariate measure of central tendency we have Mean, Median, Mode. It tries to capture how your data points jump around. Univariate Types:
  • 25. Types of Data Qualitative Quantitative A variable which able to describe quality of the population.(Numerical values) A variable which quantify the population(distinct categories) Ordinal Discrete Continuous Nominal
  • 26. Nominal Ordinal . Discrete . Continuous It represent qualitative information without order. Value represent a discrete units. Like Gender: Male/Female ,Eye colour. It represent qualitative information with order. It indicate the measurement classification are different and can be ranked. Lets say Economic status: high/medium/low which can ordered as low,medium,high. A number within a range of a value is usually measured, such as height. It has a discrete value that means it take only counted value not a decimal values. Like count of student in class Types of Data
  • 27. . Segmented Univariate Analysis Segmented Univariate Analysis allow you to compare subset of data it help us to understand how the relevant metric varies across the different segment. The Standard process of segmented univariate analysis is as follow: • Take a raw data • Group by dimensions • Summarise using a relevant metric like mean ,median. • Compare the aggregate metric across the categories.
  • 28. BivariateAnalysis Correlation . . Data which has two variables ,you often want to measure the relationship that exists between these two variables. Correlation measure the strength as well as the direction of the linear relationship between the two variables. Its range is from -1 to +1. Bi-variate Types: Covariance measure how much two random variable vary together. Its range is from -∞ to +∞. Covariance ● If one increases as the other increases, the correlation is positive ● If one decreases as the other increases, the correlation is negative ● If one stays constant as the other varies, the correlation is zero
  • 29. To see the distribution of two categorical variables. For example, if you want to compare the number of boys and girls who play games, you can make a ‘cross table’ as given below: To see the distribution of two categorical variables with one continuous variable. For example, you saw how a student’s percentage in science is distributed based on the father’s occupation (categorical variable 1) and the poverty level (categorical variable 2). Example
  • 30. Derived Metrics Feature Binning . . Derived metrics create a new variable from the existing variable to get a insightful information from the data by analysing the data. Feature Encoding From Domain Knowledge Calculated from Data
  • 31. . Feature Binning Feature binning converts or transform continuous/numeric variable to categorical variable. It can also be used to identify missing values or outliers. Type of Binning:- 1. Unsupervised Binning a)Equal width binning b)Equal frequency binning 2. Supervised Binning a) Entropy based binning
  • 32. . Supervised Binning . Un Supervised Binning It transform continuous or numeric variable into categorical value without taking dependent variable into consideration It transform continuous or numeric variable into categorical value taking dependent variable into consideration. Equal Width Equal width separate the continuous variable to several categories having same range of width. Equal Frequency Equal frequency separate the continuous variable into several categories having approximately same number of values. Entropy Based Entropy based binning separate the continuous or numeric variable majority of values in a category belong to same label of class.
  • 33. Feature Encoding Feature encoding help us to transform categorical data into numeric data. Label encoding Label encoding is technique to transform categorical variables into numerical variables by assigning a numerical value to each of the categories. One-Hot encoding This technique is used when independent variables are nominal. It creates k different columns each for a category and replaces one column with 1 rest of the columns is 0. Here, 0 represents the absence, and 1 represents the presence of that category. Target Encoding In target encoding, we calculate the average of the dependent variable for each category and replace the category variable with the mean value TheHashencoder represents categorical independent variable using the new dimensions. Here, the user can fix the number of dimensions after transformation using component argument. Hash Encoder
  • 34. Uses Cases Uses Cases Basically EDA is important in every business problem, it’s a first crucial step in data analysis process. Some of the use cases where we use EDA is:- • Cancer Data Analysis :- In this data set we have to predict who are suffering from cancer and who’s not. • Fraud Data Analysis in E-commerce Transactions :- In this dataset we have to detect the fraud in a E-commerce transaction.