1. Exploratory Data Analysis
(EDA) in R
Big Data & IoT
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
2. Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is analyzing the data using visual
techniques. It is used to discover trends or patterns or to check assumptions
with the help of statistical summaries and graphical representations.
• Why we use EDA?
• Detection of mistakes.
• Checking of assumptions.
• Preliminary selection of appropriate models.
• Determining the relationships among the exploratory variables.
• Accessing the rough size and direction of relationships between exploratory
and outcomes variables.
3. Techniques:
Most EDA techniques are graphical in nature with a few quantitative
techniques. The particular graphical techniques employed in EDA are
often quite simple, consisting of various techniques of:
• Plotting the raw data such as histograms and probability plots.
• Plotting simple statistics such as mean pots and standard deviation and
box plot.
4. Tools:
Some of the most common data science tools used to create an EDA include:
• Python: An interpreted, object-oriented programming language with dynamic
semantics. Its high-level, built-in data structures. Python and EDA can be used
together to identify missing values in a data set, which is important so you can
decide how to handle missing values for machine learning.
• R: An open-source programming language and free software environment for
statistical computing and graphics supported by the R Foundation for Statistical
Computing. The R language is widely used among statisticians in data science in
developing statistical observations and data analysis.
5. Exploratory Data Analysis In R
In R Language we are going to perform EDA under two broad classifications:
• Descriptive Statistics, which include mean, median, mode, inter-quartile range,
and so on.
• Graphical Methods, which include histograms, box plots, and so on.
6. How to perform Exploratory Data Analysis In
R
This involves exploring a dataset in three ways:
• 1. Summarizing a dataset using descriptive statistics.
• 2. Visualizing a dataset using charts.
• 3. Normalizing dataset
7. Factor in R
• What is factor in R?
• Factors are variables in R which take on a limited number of different values; such as variables are often
referred to as categorical variables.
• In a dataset we can distinguish two types of variables categorical and continuous.
• In a categorical variable, the value is limited and usually based on a particular finite group. For example, a
categorical variable can be countries, years, and gender.
• A continuous variable however can take any value from integer to decimal. For example, we can have
revenue, the price of a share.
8. R charts and graphs:
A pie chart is a representation of values as slices of a circle with different colors. The slices are labeled and the
numbers corresponding to each slide are also represented in the chart.
In R the pie chart I created using the pie () function which takes positive numbers as a vector input. The additional
parameters are used to control labels, colors, titles, etc.
Syntax:
Pie (x, label, radius, main, col, clockwise)
Following is the description of the parameters used:
• X is a vector containing the numeric values used in the pie chart.
• Labels are used to give a description of the slices.
• Radius indicates the radius of the circle of the pie chart.
• Main indicates the title of the pie chart.
• The color indicates the color palette.
• Clockwise is a logical value indicating if the slices are drawn clockwise or anti-clockwise.
11. Bar charts:
A bar chart presents data in rectangular bars with the length of the bar proportional to the value of R. R uses the
function bar plot () to create bar charts. R can be drawn in both vertical a horizontal bar in the bar chart. In the bar
chart, each of the bars can give a different color.
Syntax:
The basic syntax to create a bar chart in R is:
Bar plot (H, lab, lab, main, nams.org, col)
Following is the description of the parameters used:
• H is a vector or matrix containing numeric values used in the chart.
• xlab is the label for the x-axis
• ylab is the label for the y-axis.
• Main I the title of the bar chart.
• Names.org is a ctr of names appearing under each bar.
• Col is used to give colors to the bars in the graph.
14. Histograms:
Histograms represent the frequencies of a variable. Each bar in the histogram represents the height of the number of values present.
R creates histogram using hist () function. This function takes a vector as input and uses some parameters to plot histograms.
Syntax:
Hist (v, main, xlab,ylab, xlim, ylim, breaks, col, border)
• V is a vector containing numeric values in histogram.
• Man indicates the title of the chart.
• Col is used to set the color.
• The border is used to set the border color of each bar.
• xlab is used to give a description of the x-axis
• ylab is used to give a description of the y-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• Breaks are used to mention the width of each bar.
16. Line graphs:
The line chart is a graph that connects a series o points by drawing segments between them. Line charts are usually
used in identifying the trends in data.
The plot () function in R is used to create the line graph.
Syntax:
Plot (v, type, col, xlab, ylab)
Following is the description of the parameters used:
• v is a vector containing the numeric values.
• type takes the value “p” to draw only the points, “l” to draw the lines, and “o” to draw both points and lines.
• xlab is used to give a description of the x-axis
• ylab is used to give a description of the y-axis.
• main indicates the title of the chart.
• Col is used to set the color.
18. Scatter plot
The scatterplot shows many points plotted in the Cartesian plane. Each point represents the values of two variables.
One variable is chosen on the horizontal axis and another on the vertical axis.
The simple scatterplot is created using the plot () function.
Syntax:
Plot(x, y, main, xlab, ylab , xlim, ylim)
Following is the description of the parameters used:
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• xlab is used to give a description of the x-axis
• ylab is used to give a description of the y-axis.
• Xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• main indicates the title of the chart.
20. IRIS dataset:
• The dataset contains four features sepal length, sepal width, petal length, and petal width for each of the
different species (Versicolor, Virginica, Setosa) of the iris flower.