SlideShare une entreprise Scribd logo
1  sur  35
Exploratory Data Analysis
(EDA) in R
Big Data & IoT
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is analyzing the data using visual
techniques. It is used to discover trends or patterns or to check assumptions
with the help of statistical summaries and graphical representations.
• Why we use EDA?
• Detection of mistakes.
• Checking of assumptions.
• Preliminary selection of appropriate models.
• Determining the relationships among the exploratory variables.
• Accessing the rough size and direction of relationships between exploratory
and outcomes variables.
Techniques:
Most EDA techniques are graphical in nature with a few quantitative
techniques. The particular graphical techniques employed in EDA are
often quite simple, consisting of various techniques of:
• Plotting the raw data such as histograms and probability plots.
• Plotting simple statistics such as mean pots and standard deviation and
box plot.
Tools:
Some of the most common data science tools used to create an EDA include:
• Python: An interpreted, object-oriented programming language with dynamic
semantics. Its high-level, built-in data structures. Python and EDA can be used
together to identify missing values in a data set, which is important so you can
decide how to handle missing values for machine learning.
• R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R Foundation for Statistical
Computing. The R language is widely used among statisticians in data science in
developing statistical observations and data analysis.
Exploratory Data Analysis In R
In R Language we are going to perform EDA under two broad classifications:
• Descriptive Statistics, which include mean, median, mode, inter-quartile range,
and so on.
• Graphical Methods, which include histograms, box plots, and so on.
How to perform Exploratory Data Analysis In
R
This involves exploring a dataset in three ways:
• 1. Summarizing a dataset using descriptive statistics.
• 2. Visualizing a dataset using charts.
• 3. Normalizing dataset
Factor in R
• What is factor in R?
• Factors are variables in R which take on a limited number of different values; such as variables are often
referred to as categorical variables.
• In a dataset we can distinguish two types of variables categorical and continuous.
• In a categorical variable, the value is limited and usually based on a particular finite group. For example, a
categorical variable can be countries, years, and gender.
• A continuous variable however can take any value from integer to decimal. For example, we can have
revenue, the price of a share.
R charts and graphs:
A pie chart is a representation of values as slices of a circle with different colors. The slices are labeled and the
numbers corresponding to each slide are also represented in the chart.
In R the pie chart I created using the pie () function which takes positive numbers as a vector input. The additional
parameters are used to control labels, colors, titles, etc.
Syntax:
Pie (x, label, radius, main, col, clockwise)
Following is the description of the parameters used:
• X is a vector containing the numeric values used in the pie chart.
• Labels are used to give a description of the slices.
• Radius indicates the radius of the circle of the pie chart.
• Main indicates the title of the pie chart.
• The color indicates the color palette.
• Clockwise is a logical value indicating if the slices are drawn clockwise or anti-clockwise.
Continue..
Continue…
Bar charts:
A bar chart presents data in rectangular bars with the length of the bar proportional to the value of R. R uses the
function bar plot () to create bar charts. R can be drawn in both vertical a horizontal bar in the bar chart. In the bar
chart, each of the bars can give a different color.
Syntax:
The basic syntax to create a bar chart in R is:
Bar plot (H, lab, lab, main, nams.org, col)
Following is the description of the parameters used:
• H is a vector or matrix containing numeric values used in the chart.
• xlab is the label for the x-axis
• ylab is the label for the y-axis.
• Main I the title of the bar chart.
• Names.org is a ctr of names appearing under each bar.
• Col is used to give colors to the bars in the graph.
Continue:
Continue..
Histograms:
Histograms represent the frequencies of a variable. Each bar in the histogram represents the height of the number of values present.
R creates histogram using hist () function. This function takes a vector as input and uses some parameters to plot histograms.
Syntax:
Hist (v, main, xlab,ylab, xlim, ylim, breaks, col, border)
• V is a vector containing numeric values in histogram.
• Man indicates the title of the chart.
• Col is used to set the color.
• The border is used to set the border color of each bar.
• xlab is used to give a description of the x-axis
• ylab is used to give a description of the y-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• Breaks are used to mention the width of each bar.
Continue..
Line graphs:
The line chart is a graph that connects a series o points by drawing segments between them. Line charts are usually
used in identifying the trends in data.
The plot () function in R is used to create the line graph.
Syntax:
Plot (v, type, col, xlab, ylab)
Following is the description of the parameters used:
• v is a vector containing the numeric values.
• type takes the value “p” to draw only the points, “l” to draw the lines, and “o” to draw both points and lines.
• xlab is used to give a description of the x-axis
• ylab is used to give a description of the y-axis.
• main indicates the title of the chart.
• Col is used to set the color.
continue..
Scatter plot
The scatterplot shows many points plotted in the Cartesian plane. Each point represents the values of two variables.
One variable is chosen on the horizontal axis and another on the vertical axis.
The simple scatterplot is created using the plot () function.
Syntax:
Plot(x, y, main, xlab, ylab , xlim, ylim)
Following is the description of the parameters used:
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• xlab is used to give a description of the x-axis
• ylab is used to give a description of the y-axis.
• Xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• main indicates the title of the chart.
Continue..
IRIS dataset:
• The dataset contains four features sepal length, sepal width, petal length, and petal width for each of the
different species (Versicolor, Virginica, Setosa) of the iris flower.
R-studio:
How to use R-Studio:
Continue..
Continue..
Continue..
Continue..
Continue..
Continue..
Continue..
continue..
Continue..
Continue..
Continue..
Continue..
Continue..

Contenu connexe

Tendances

Scatter plots
Scatter plotsScatter plots
Scatter plots
swartzje
 

Tendances (20)

Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Pca analysis
Pca analysisPca analysis
Pca analysis
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Scatter plots
Scatter plotsScatter plots
Scatter plots
 
Missing Data and data imputation techniques
Missing Data and data imputation techniquesMissing Data and data imputation techniques
Missing Data and data imputation techniques
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
7 data transformation
7  data transformation7  data transformation
7 data transformation
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data Visualization - A Brief Overview
Data Visualization - A Brief OverviewData Visualization - A Brief Overview
Data Visualization - A Brief Overview
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
PCA
PCAPCA
PCA
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 

Similaire à Exploratory Data Analysis

Line & Bar Graphs
Line & Bar GraphsLine & Bar Graphs
Line & Bar Graphs
herbison
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Himakshi7
 
Week-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxWeek-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docx
helzerpatrina
 

Similaire à Exploratory Data Analysis (20)

Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using r
 
Chart and graphs in R programming language
Chart and graphs in R programming language Chart and graphs in R programming language
Chart and graphs in R programming language
 
Line & Bar Graphs
Line & Bar GraphsLine & Bar Graphs
Line & Bar Graphs
 
TYPES OF GRAPH & FLOW CHART
TYPES OF GRAPH & FLOW CHARTTYPES OF GRAPH & FLOW CHART
TYPES OF GRAPH & FLOW CHART
 
POLAR PLOT.pptx
POLAR PLOT.pptxPOLAR PLOT.pptx
POLAR PLOT.pptx
 
DATA GRAPHICS 8th Sem.pdf
DATA GRAPHICS 8th Sem.pdfDATA GRAPHICS 8th Sem.pdf
DATA GRAPHICS 8th Sem.pdf
 
Presentation of data ppt
Presentation of data pptPresentation of data ppt
Presentation of data ppt
 
POLAR PLOT (1).pptx
POLAR PLOT (1).pptxPOLAR PLOT (1).pptx
POLAR PLOT (1).pptx
 
Independent and Dependent Variables
Independent and Dependent VariablesIndependent and Dependent Variables
Independent and Dependent Variables
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
 
Quantitative techniques in business
Quantitative techniques in businessQuantitative techniques in business
Quantitative techniques in business
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
Week-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxWeek-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docx
 
Data Mining Lecture_9.pptx
Data Mining Lecture_9.pptxData Mining Lecture_9.pptx
Data Mining Lecture_9.pptx
 
Data presentation.pptx
Data presentation.pptxData presentation.pptx
Data presentation.pptx
 
graphic representations in statistics
 graphic representations in statistics graphic representations in statistics
graphic representations in statistics
 
seminar.pptx
seminar.pptxseminar.pptx
seminar.pptx
 
GRAPHICAL REPRESENTATION.pptx(unit 4).pptx
GRAPHICAL REPRESENTATION.pptx(unit 4).pptxGRAPHICAL REPRESENTATION.pptx(unit 4).pptx
GRAPHICAL REPRESENTATION.pptx(unit 4).pptx
 
Graphs- A tool to present data
Graphs- A tool to present dataGraphs- A tool to present data
Graphs- A tool to present data
 
Linear relations
   Linear relations   Linear relations
Linear relations
 

Plus de Umair Shafique

BIG DATA ANALYTICS USING R
BIG DATA ANALYTICS USING  RBIG DATA ANALYTICS USING  R
BIG DATA ANALYTICS USING R
Umair Shafique
 

Plus de Umair Shafique (6)

Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
BIG DATA ANALYTICS USING R
BIG DATA ANALYTICS USING  RBIG DATA ANALYTICS USING  R
BIG DATA ANALYTICS USING R
 
BIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNINGBIG DATA AND MACHINE LEARNING
BIG DATA AND MACHINE LEARNING
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Exploratory Data Analysis

  • 1. Exploratory Data Analysis (EDA) in R Big Data & IoT Umair Shafique (03246441789) Scholar MS Information Technology - University of Gujrat
  • 2. Exploratory Data Analysis (EDA) • Exploratory Data Analysis (EDA) is analyzing the data using visual techniques. It is used to discover trends or patterns or to check assumptions with the help of statistical summaries and graphical representations. • Why we use EDA? • Detection of mistakes. • Checking of assumptions. • Preliminary selection of appropriate models. • Determining the relationships among the exploratory variables. • Accessing the rough size and direction of relationships between exploratory and outcomes variables.
  • 3. Techniques: Most EDA techniques are graphical in nature with a few quantitative techniques. The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of: • Plotting the raw data such as histograms and probability plots. • Plotting simple statistics such as mean pots and standard deviation and box plot.
  • 4. Tools: Some of the most common data science tools used to create an EDA include: • Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning. • R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.
  • 5. Exploratory Data Analysis In R In R Language we are going to perform EDA under two broad classifications: • Descriptive Statistics, which include mean, median, mode, inter-quartile range, and so on. • Graphical Methods, which include histograms, box plots, and so on.
  • 6. How to perform Exploratory Data Analysis In R This involves exploring a dataset in three ways: • 1. Summarizing a dataset using descriptive statistics. • 2. Visualizing a dataset using charts. • 3. Normalizing dataset
  • 7. Factor in R • What is factor in R? • Factors are variables in R which take on a limited number of different values; such as variables are often referred to as categorical variables. • In a dataset we can distinguish two types of variables categorical and continuous. • In a categorical variable, the value is limited and usually based on a particular finite group. For example, a categorical variable can be countries, years, and gender. • A continuous variable however can take any value from integer to decimal. For example, we can have revenue, the price of a share.
  • 8. R charts and graphs: A pie chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slide are also represented in the chart. In R the pie chart I created using the pie () function which takes positive numbers as a vector input. The additional parameters are used to control labels, colors, titles, etc. Syntax: Pie (x, label, radius, main, col, clockwise) Following is the description of the parameters used: • X is a vector containing the numeric values used in the pie chart. • Labels are used to give a description of the slices. • Radius indicates the radius of the circle of the pie chart. • Main indicates the title of the pie chart. • The color indicates the color palette. • Clockwise is a logical value indicating if the slices are drawn clockwise or anti-clockwise.
  • 11. Bar charts: A bar chart presents data in rectangular bars with the length of the bar proportional to the value of R. R uses the function bar plot () to create bar charts. R can be drawn in both vertical a horizontal bar in the bar chart. In the bar chart, each of the bars can give a different color. Syntax: The basic syntax to create a bar chart in R is: Bar plot (H, lab, lab, main, nams.org, col) Following is the description of the parameters used: • H is a vector or matrix containing numeric values used in the chart. • xlab is the label for the x-axis • ylab is the label for the y-axis. • Main I the title of the bar chart. • Names.org is a ctr of names appearing under each bar. • Col is used to give colors to the bars in the graph.
  • 14. Histograms: Histograms represent the frequencies of a variable. Each bar in the histogram represents the height of the number of values present. R creates histogram using hist () function. This function takes a vector as input and uses some parameters to plot histograms. Syntax: Hist (v, main, xlab,ylab, xlim, ylim, breaks, col, border) • V is a vector containing numeric values in histogram. • Man indicates the title of the chart. • Col is used to set the color. • The border is used to set the border color of each bar. • xlab is used to give a description of the x-axis • ylab is used to give a description of the y-axis. • xlim is used to specify the range of values on the x-axis. • ylim is used to specify the range of values on the y-axis. • Breaks are used to mention the width of each bar.
  • 16. Line graphs: The line chart is a graph that connects a series o points by drawing segments between them. Line charts are usually used in identifying the trends in data. The plot () function in R is used to create the line graph. Syntax: Plot (v, type, col, xlab, ylab) Following is the description of the parameters used: • v is a vector containing the numeric values. • type takes the value “p” to draw only the points, “l” to draw the lines, and “o” to draw both points and lines. • xlab is used to give a description of the x-axis • ylab is used to give a description of the y-axis. • main indicates the title of the chart. • Col is used to set the color.
  • 18. Scatter plot The scatterplot shows many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen on the horizontal axis and another on the vertical axis. The simple scatterplot is created using the plot () function. Syntax: Plot(x, y, main, xlab, ylab , xlim, ylim) Following is the description of the parameters used: • x is the data set whose values are the horizontal coordinates. • y is the data set whose values are the vertical coordinates. • xlab is used to give a description of the x-axis • ylab is used to give a description of the y-axis. • Xlim is used to specify the range of values on the x-axis. • ylim is used to specify the range of values on the y-axis. • main indicates the title of the chart.
  • 20. IRIS dataset: • The dataset contains four features sepal length, sepal width, petal length, and petal width for each of the different species (Versicolor, Virginica, Setosa) of the iris flower.
  • 22. How to use R-Studio: