SlideShare une entreprise Scribd logo
1  sur  68
DATA MANIPULATION WITH R
(DATA VISUALIZATION)
Dr. P. Rambabu, M. Tech., Ph.D., F.I.E.
24-Feb-2022
Topics
1. Introduction to Data Science
2. Prerequisites (tidyverse)
3. Import Data (readr)
4. Data Tyding (tidyr)
a) pivot_longer(), pivot_wider()
b) separate(), unite()
5. Data Transformation (dplyr - Grammar of Manipulation)
a) arrange()
b) filter()
c) select()
d) mutate()
e) summarise()
6. Data Visualization (ggplot - Grammar of Graphics)
a) Column Chart, Stacked Column Graph, Bar Graph
b) Line Graph, Dual Axis Chart, Area Chart
c) Pie Chart, Heat Map
d) Scatter Chart, Bubble Chart
“The simple graph has brought more information to the
data analyst’s mind than any other device.” — John
Tukey
Introduction to Data Science
Data science is an exciting discipline that allows you to turn raw data into understanding,
insight, and knowledge.
Typical Data Science Project will follow the below process:
Procedure:
1. Import:
First you must import your data into R. This typically means that you take data stored in a
file, database, or web application programming interface (API), and load it into a data frame
in R. If you can’t get your data into R, you can’t do data science on it!
2. Tidy:
Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing
it in a consistent form that matches the semantics of the dataset with the way it is stored. In
brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy
data is important because the consistent structure lets you focus your struggle on questions
about the data, not fighting to get the data into the right form for different functions.
3. Transform:
Once you have tidy data, a common first step is to transform it. Transformation includes
narrowing in on observations of interest (like all people in one city, or all data from the last
year), creating new variables that are functions of existing variables (like computing speed
from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called wrangling
Once you have tidy data with the variables you need, there are two main engines of
knowledge generation:
4. Visualization
It is a fundamentally human activity. A good visualization will show you things that you did not
expect, or raise new questions about the data. A good visualization might also hint that you’re
asking the wrong question, or you need to collect different data. Visualizations can surprise
you, but don’t scale particularly well because they require a human to interpret them.
5. Modelling
Models are complementary tools to visualization. Once you have made your questions
sufficiently precise, you can use a model to answer them. Models are a fundamentally
mathematical or computational tool, so they generally scale well. Even when they don’t, it’s
usually cheaper to buy more computers than it is to buy more brains! But every model makes
assumptions, and by its very nature a model cannot question its own assumptions. That
means a model cannot fundamentally surprise you.
These have complementary strengths and weaknesses so any real analysis will iterate
between them many times.
6. Communication
The last step of data science is communication, an absolutely critical part of any
data analysis project. It doesn’t matter how well your models and visualisation have
led you to understand the data unless you can also communicate your results to
others.
Surrounding all these tools is programming. Programming is a cross-cutting tool
that you use in every part of the project. You don’t need to be an expert programmer
to be a data scientist, but learning more about programming pays off because
becoming a better programmer allows you to automate common tasks, and solve
new problems with greater ease.
Prerequisites
There are four things you need to run the code.
1. R: To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set
of mirror servers distributed around the world and is used to distribute R and R packages.
2. Rstudio: It is an integrated development environment, or IDE, for R programming. Download and
install it from http://www.rstudio.com/download
3. Tidyverse: An R package is a collection of functions, data, and documentation that extends the
capabilities of base R.
4. Other Packages
*Packages are the fundamental units of reproducible R code. They include reusable functions, the
documentation that describes how to use them, and sample data.
# Install the complete tidyverse with
install.packages("tidyverse")
Tidyverse – Collection of R
Packages
ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell
ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
dplyr
dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data
manipulation challenges.
tidyr
tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every
variable goes in a column, and every column is a variable.
readr
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse
many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
Import Data
Before you can manipulate data with R, you need to import the data into R’s memory, or build a connection to the data
that R can use to access the data remotely. For example, you can build a connection to data that lives in a database.
How you import your data will depend on the format of the data. The most common way to store small data sets is as
a plain text file. Data may also be stored in a proprietary format associated with a specific piece of software, such as
SAS, SPSS, or Microsoft Excel. Data used on the internet is often stored as a JSON or XML file. Large data sets may be
stored in a database or a distributed storage system.
The readr package contains the most common functions in the tidyverse for importing data. The readr package is
loaded when you run library(tidyverse). The tidyverse also includes the following packages for importing specific types
of data. These are not loaded with library(tidyverse). You must load them individually when you need them.
• DBI - connect to databases
• haven - read SPSS, Stata, or SAS data
• httr - access data over web APIs
• jsonlite - read JSON
• readxl - read Excel spreadsheets
• rvest - scrape data from the web
• xml2 - read XML
R Built-in Data Sets
R comes with several built-in data sets, which are generally used as demo data for playing with R functions.
some of the most used R demo data sets:
• mtcars,
• iris,
• ToothGrowth,
• PlantGrowth and
• USArrests.
data()
List of pre-loaded data
To see the list of pre-loaded data,
type the function data():
Loading a built-in R data
Load and print mtcars data as follow:
# Loading
data(mtcars)
# Print the first 6 rows
head(mtcars, 6)
If you want learn more about mtcars data sets, type this:
?mtcars
mtcars: Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of
automobile design and performance for 32 automobiles (1973–74 models)
Data Manipulation
Data Scientists spend most of their time cleaning and manipulating data rather than mining or modeling
them for insights. As such, it becomes important to have tools like dplyr which makes data
manipulation faster and easier.
dplyr is also called grammar of data manipulation.
Tidyr for Data Transformation
The goal of tidyr is to help you create tidy data. Tidy data is data where:
• Every column is variable.
• Every row is an observation.
• Every cell is a single value.
key functions in the tidyr package:
1. pivot_longer() lengthens data, increasing the number of rows and decreasing the number of columns (i.e., turning
columns into rows)
2. pivot_wider() widens data, increasing the number of columns and decreasing the number of rows (i.e., turning
rows into columns)
3. separate() separates a character column into multiple columns with a regular expression or numeric locations
4. unite() unites multiple columns into one by pasting strings together
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure
that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.
Installation:
# The easiest way to get tidyr is to install the whole tidyverse:
install.packages("tidyverse")
# Alternatively, install just tidyr:
install.packages("tidyr")
library(tidyr)
1. pivot_longer()
This function "lengthens" data, increasing the number of rows and decreasing the number of columns.
The inverse transformation is pivot_wider().
2. pivot_wider()
This funciton "widens" data, increasing the number of columns and decreasing the number of rows. The inverse
transformation is pivot_longer().
#create data frame
player <- c('A', 'A', 'B', 'B', 'C', 'C')
df <- data.frame(player,
year=c(1, 2, 1, 2, 1, 2),
stats=c('22-2', '29-3', '18-6', '11-8', '12-5', '19-2'),
date = c("27/01/2015","23/02/2015", "31/03/2015","20/01/2015", "23/02/2015", "31/01/2015"))
#view data frame
df
3. separate()
It converts longer data to a wider format. The separate() function turns a single character column into multiple columns.
#separate stats column into points and assists columns
a <- separate(df, col=date, into=c('date', 'month','year'), sep='/')
print(a)
4. unite()
It merges two columns into one column. The unite() function is a convenience function to paste together multiple
variable values into one. In essence, it combines two variables of a single observation into one variable.
b <-unite(a,Date, c(date, month, year), sep = ".")
print(b)
Data Manipulation using dplyr:
Library(tydyverse)
Dplyr verbs
dplyr provides a set of verbs that help us solve the most
common data manipulation challenges while working with
tabular data (dataframes, tibbles):
library(dplyr)
OR
1. select( ) –used to select columns of interest from a data set
2. arrange( ) –used to arrange data set values on ascending or
descending order
3. filter( ) – used to select the row’s by filters the data based
on a condition
4. mutate( ) – used to create new variables or columns, its
values are based on existing columns.
5. summarise( ) – used to perform analysis by commonly used
operations such as min, max, mean count etc.
# load flights dataset from nyflights13 which is a tibble
flights <- nycflights13::flights
head(flights)
Tibbles are the core data structure of the
tidyverse and is used to facilitate the display and
analysis of information in a tidy format.
Tibbles is a new form of data frame where data
frames are the most common data structures
used to store data sets in R.
Different ways to create Tibbles
1. as_tibble(): The first function is as tibble
function. This function is used to create a
tibble from an existing data frame.
2. tibble(): The second way is to use tibble()
function used to create a tibble from
scratch.
3. import(): it is used to import tidyverse’s
data to create Tibbles from external data
sources such as databases or CSV
4. library(): It is used to load the
namespace of the package.
1. Select
Select columns with select()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often
narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using
operations based on the names of the variables.
# Select columns by name
head(select(flights, year, month, day))
# Select all columns except those from year to day (inclusive)
head(select(flights, -(year:day)))
There are a number of helper functions you can use within select():
• starts_with("abc"): matches names that begin with “abc”.
• ends_with("xyz"): matches names that end with “xyz”.
• contains("ijk"): matches names that contain “ijk”.
• matches("(.)1"): selects variables that match a regular expression. This one matches any variables that contain
repeated characters.
• num_range("x", 1:3): matches x1, x2 and x3.
Rename: It is useful to rename variable
#Use rename() rename ‘talinum’ as ‘tail_num’
head(rename(flights, tail_num = tailnum))
everything(): it is useful if you have a handful of
variables you’d like to move to the start of the data
frame.
#move ‘time_hour’, ‘air_time’ to the begin
head(select(flights, time_hour, air_time,
everything()))
2. Arrange
Reorder the rows (arrange()).
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and
a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each
additional column will be used to break ties in the values of preceding columns:
# Sort data by year, month, day
head(arrange(flights, year, month, day))
#Use desc() to re-order by a column in descending order:
head(arrange(flights, desc(dep_delay)))
#Missing values are always sorted at the end:
df <- tibble(x = c(5, NA, 2))
arrange(df, x)
A tibble: 3 × 1
x
<dbl>
5
2
NA
3. Filtering
Filtering provides a way to help reduce the number of rows in your tibble. When performing filtering, we can
specify conditions or specific criteria that are used to reduce the number of rows in the dataset.
# create flights dataset
flights <- nycflights13::flights
head(flights)
# filter flights data by month
jan_data <- filter(flights, month == 1)
tail(jan_data)
# filter flights data by Date and Month.
Jan1 <- filter(flights, month == 1, day == 1)
head(Jan1)
# Filter all flights that departed in November or December:
nov_dec <- filter(flights, month == 11 | month == 12)
head(nov_dec)
#alternatively
nov_dec <- filter(flights, month %in% c(11, 12))
head(nov_dec)
4. Mutate
Add new variables with mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns.
mutate() always adds new columns at the end of your dataset
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
head(flights_sml)
head(mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60))
Note that you can refer to columns that you’ve just created:
head(mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
If you only want to keep the new variables, use transmute():
head(transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights
dataset, you can compute hour and minute from dep_time with:
head(transmute(flights,
dep_time,
hour = dep_time %/% 60,
minute = dep_time %% 60
))
5. Summarise
# summarize - it collapses a data frame to a single row
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(flights, year, month, day)
head(summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)))
Grouped summaries with summarise()
This changes the unit of analysis from the complete dataset to individual groups. Then,
when you use the dplyr verbs on a grouped data frame they’ll be automatically applied
“by group”
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what
you know about dplyr, you might write code like this:
by_dest <- group_by(flights, dest)
head(delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
))
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
head(delays)
Combining multiple operations with the pipe
Data Visualization
Grammar of Graphics
ggplot2 package in R Programming Language also termed as Grammar of Graphics is a free, open-source, and
easy-to-use visualization package widely used in R. It is the most powerful visualization package written by
Hadley Wickham.
It includes several layers on which it is governed. The layers are as follows:
Building Blocks of layers with the grammar of graphics
1. Data: The element is the data set itself
2. Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill, size, labels,
alpha, shape, line width, line type
3. Geometrics: How our data being displayed using point, line, histogram, bar, boxplot
4. Facets: It displays the subset of the data using Columns and rows
5. Statistics: Binning, smoothing, descriptive, intermediate
6. Coordinates: the space between data and display using Cartesian, fixed, polar, limits
7. Themes: Non-data link
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
The Layered Grammar of Graphics
Data Visualization with R
Data Visualization
mpg data frame found in ggplot2 (aka ggplot2::mpg).
Note:
1. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
2. mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.
# to get know about mpg data (Fuel economy data from 1999 to 2008 for 38 popular models of cars)
?mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on <URL: https://fueleconomy.gov/>. It contains only models which
had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
A data frame with 234 rows and 11 variables:
1. manufacturer - manufacturer name
2. model - model name
3. displ - engine displacement, in litres
4. year - year of manufacture
5. cyl - number of cylinders
6. trans - type of transmission
7. drv - the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
8. cty - city miles per gallon
9. hwy - highway miles per gallon
10. fl - fuel type
11. class - "type" of car
1. Column Chart
A column chart is used to show a comparison among different items, or it can show a comparison of items
over time. You could use this format to see the revenue per landing page or customers by close date.
Design Best Practices for Column
Charts:
1. Use consistent colors throughout the
chart, selecting accent colors to
highlight meaningful data points or
changes over time.
2. Use horizontal labels to improve
readability.
3. Start the y-axis at 0 to appropriately
reflect the values in your graph.
A histogramrepresents the frequencies of values of a variable bucketed into ranges.
Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram
represents the height of the number of values present in that range.
2. Bar Graph
A bar graph, basically a horizontal column chart, should be used to avoid clutter when one data label is long or
if you have more than 10 items to compare. This type of visualization can also be used to display negative
numbers
Design Best Practices for Bar
Graphs:
1. Use consistent colors throughout
the chart, selecting accent colors to
highlight meaningful data points or
changes over time.
2. Use horizontal labels to improve
readability.
3. Start the y-axis at 0 to appropriately
reflect the values in your graph.
3. Line Graph
A line graph reveals trends or progress over time and can be used to show many different categories of data.
You should use it when you chart a continuous data set.
Design Best Practices for Line
Graphs:
1. Use solid lines only.
2. Don't plot more than four lines to
avoid visual distractions.
3. Use the right height so the lines
take up roughly 2/3 of the y-axis'
height.
4. Dual Axis Chart
A dual axis chart allows you to plot data using two y-axes and a shared x-axis. It's used with three data sets,
one of which is based on a continuous set of data and another which is better suited to being grouped by
category. This should be used to visualize a correlation or the lack thereof between these three data sets.
Design Best Practices for Dual Axis
Charts:
1. Use the y-axis on the left side for the
primary variable because brains are
naturally inclined to look left first.
2. Use different graphing styles to
illustrate the two data sets, as
illustrated above.
3. Choose contrasting colors for the two
data sets
5. Area Chart
An area chart is basically a line chart, but the space between the x-axis and the line is filled with a color or
pattern. It is useful for showing part-to-whole relations, such as showing individual sales reps' contribution to
total sales for a year. It helps you analyze both overall and individual trend information
Design Best Practices for Area Charts:
1. Use transparent colors so information
isn't obscured in the background.
2. Don't display more than four
categories to avoid clutter.
3. Organize highly variable data at the
top of the chart to make it easy to
read.
6. Stacked Bar Chart
This should be used to compare many different items and show the composition of each item being compared.
Design Best Practices for
Stacked Bar Graphs:
1. Best used to illustrate part-to-
whole relationships.
2. Use contrasting colors for
greater clarity.
3. Make chart scale large
enough to view group sizes in
relation to one another.
7. Pie Chart
A pie chart shows a static number and how categories represent part of a whole -- the composition of
something. A pie chart represents numbers in percentages, and the total sum of all segments needs to equal
100%.
Design Best Practices for Pie Charts:
1. Don't illustrate too many categories to
ensure differentiation between slices.
2. Ensure that the slice values add up to
100%.
3. Order slices according to their size.
8. Scatter Plot Chart
A scatter plot or scattergram chart will show the relationship between two different variables or it can reveal the
distribution trends. It should be used when there are many different data points, and you want to highlight
similarities in the data set. This is useful when looking for outliers or for understanding the distribution of your
data.
Design Best Practices for Scatter Plots:
1. Include more variables, such as different
sizes, to incorporate more data.
2. Start y-axis at 0 to represent data
accurately.
3. If you use trend lines, only use a
maximum of two to make your plot easy
to understand.
9. Bubble Chart
A bubble chart is similar to a scatter plot in that it can show distribution or relationship. There is a third data set,
which is indicated by the size of the bubble or circle
Design Best Practices for Bubble
Charts:
1. Scale bubbles according to area,
not diameter.
2. Make sure labels are clear and
visible.
3. Use circular shapes only.
10. Heat Map
A heat map shows the relationship between two items and provides rating information, such as high to low or
poor to excellent. The rating information is displayed using varying colors or saturation.
Design Best Practices for Heat Map:
1. Use a basic and clear map outline to
avoid distracting from the data.
2. Use a single color in varying shades to
show changes in data.
3. Avoid using multiple patterns.
Types of Charts and their Suitability
Creating a ggplot (Scatter Plot)
To plot mpg, run this code to put displ on the x-axis and hwy
on the y-axis:
Aesthetic
To map an aesthetic to a variable, associate the name of
the aesthetic to the name of the variable inside aes().
ggplot2 will automatically assign a unique level of the
aesthetic (here a unique color) to each unique value of
the variable, a process known as scaling.
ggplot2 will also add a legend that explains which levels
correspond to which values.
Facets
One way to add additional variables is with aesthetics.
Another way, particularly useful for categorical
variables, is to split your plot into facets, subplots that
each display one subset of the data.
To facet your plot by a single variable,
use facet_wrap().
The first argument of facet_wrap() should be a
formula, which you create with ~ followed by a variable
name (here “formula” is the name of a data structure
in R, not a synonym for “equation”). The variable that
you pass to facet_wrap() should be discrete.
To facet your plot on the combination of two variables,
add facet_grid() to your plot call. The first argument of
facet_grid() is also a formula. This time the formula should
contain two variable names separated by a ~.
Statistical transformations
Bar charts seem simple, but they are interesting because they
reveal something subtle about plots. Consider a basic bar chart,
as drawn with geom_bar().
The following chart displays the total number of diamonds in the
diamonds dataset, grouped by cut.
The diamonds dataset comes in ggplot2 and contains
information about ~54,000 diamonds, including the price, carat,
color, clarity, and cut of each diamond. The chart shows that more
diamonds are available with high quality cuts than with low quality
cuts.
Position adjustments
There’s one more piece of magic associated with bar charts. You can colour a bar chart using either the colour aesthetic, or, more
usefully, fill:
bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.
The identity position adjustment is more useful for 2d geoms, like points, where it is the default.
position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions
across groups.
position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.
Histogram
A histogram contains a rectangular area to display the statistical information which is proportional to the
frequency of a variable and its width in successive numerical intervals. A graphical representation that
manages a group of data points into different specified ranges. It has a special feature which shows no gaps
between the bars and is similar to a vertical bar graph.
library(ggplot2)
# Change colors
ggplot(mpg, aes(x=displ)) +
geom_histogram(color="black", fill="white")
Coordinate systems
Coordinate systems are probably the most
complicated part of ggplot2.
The default coordinate system is the Cartesian
coordinate system where the x and y positions act
independently to determine the location of each
point. There are a number of other coordinate
systems that are occasionally helpful.
Indented block.
coord_flip() switches the x and y axes. This is useful
(for example), if you want horizontal boxplots. It’s
also useful for long labels: it’s hard to get them to
fit without overlapping on the x-axis.
coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a
Coxcomb chart.
Theme Layer
This layer controls the finer points of display like the font size and background color properties.
Dr. Rambabu Palaka
Professor
School of Engineering
Malla Reddy University, Hyderabad
Mobile: +91-9652665840
Email: drrambabu@mallareddyuniversity.ac.in
Reference:
R for Data Science (https://r4ds.had.co.nz)

Contenu connexe

Tendances

Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With REdureka!
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Edureka!
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programmingVictor Ordu
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2izahn
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubMartin Bago
 
Data tidying with tidyr meetup
Data tidying with tidyr  meetupData tidying with tidyr  meetup
Data tidying with tidyr meetupMatthew Samelson
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to RstudioOlga Scrivner
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 

Tendances (20)

R studio
R studio R studio
R studio
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
R programming
R programmingR programming
R programming
 
R programming
R programmingR programming
R programming
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Data tidying with tidyr meetup
Data tidying with tidyr  meetupData tidying with tidyr  meetup
Data tidying with tidyr meetup
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to Rstudio
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 

Similaire à Unit 2 - Data Manipulation with R.pptx

Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxSreeLaya9
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...yashbheda
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxJethroDignadice2
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01shaziabibi5
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 

Similaire à Unit 2 - Data Manipulation with R.pptx (20)

Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
Paper presentation
Paper presentationPaper presentation
Paper presentation
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 

Plus de Malla Reddy University

Unit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptxUnit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptxMalla Reddy University
 
GIS - Project Planning and Implementation
GIS - Project Planning and ImplementationGIS - Project Planning and Implementation
GIS - Project Planning and ImplementationMalla Reddy University
 
Fluid Mechanics - Hydrostatic Pressure
Fluid Mechanics - Hydrostatic PressureFluid Mechanics - Hydrostatic Pressure
Fluid Mechanics - Hydrostatic PressureMalla Reddy University
 
Fluid Mechanics - Fluid Pressure and its measurement
Fluid Mechanics - Fluid Pressure and its measurementFluid Mechanics - Fluid Pressure and its measurement
Fluid Mechanics - Fluid Pressure and its measurementMalla Reddy University
 
Introduction to Solid Waste Management
Introduction to Solid Waste ManagementIntroduction to Solid Waste Management
Introduction to Solid Waste ManagementMalla Reddy University
 

Plus de Malla Reddy University (20)

Angular Directives
Angular DirectivesAngular Directives
Angular Directives
 
Unit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptxUnit 1 - Statistics (Part 1).pptx
Unit 1 - Statistics (Part 1).pptx
 
Unit 2 - Data Binding.pptx
Unit 2 - Data Binding.pptxUnit 2 - Data Binding.pptx
Unit 2 - Data Binding.pptx
 
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptxUnit 1 - TypeScript & Introduction to Angular CLI.pptx
Unit 1 - TypeScript & Introduction to Angular CLI.pptx
 
GIS - Project Planning and Implementation
GIS - Project Planning and ImplementationGIS - Project Planning and Implementation
GIS - Project Planning and Implementation
 
Geo-spatial Analysis and Modelling
Geo-spatial Analysis and ModellingGeo-spatial Analysis and Modelling
Geo-spatial Analysis and Modelling
 
GIS - Topology
GIS - Topology GIS - Topology
GIS - Topology
 
Geographical Information System (GIS)
Geographical Information System (GIS)Geographical Information System (GIS)
Geographical Information System (GIS)
 
Introduction to Maps
Introduction to MapsIntroduction to Maps
Introduction to Maps
 
Fluid Mechanics - Fluid Dynamics
Fluid Mechanics - Fluid DynamicsFluid Mechanics - Fluid Dynamics
Fluid Mechanics - Fluid Dynamics
 
Fluid Kinematics
Fluid KinematicsFluid Kinematics
Fluid Kinematics
 
Fluid Mechanics - Hydrostatic Pressure
Fluid Mechanics - Hydrostatic PressureFluid Mechanics - Hydrostatic Pressure
Fluid Mechanics - Hydrostatic Pressure
 
Fluid Mechanics - Fluid Pressure and its measurement
Fluid Mechanics - Fluid Pressure and its measurementFluid Mechanics - Fluid Pressure and its measurement
Fluid Mechanics - Fluid Pressure and its measurement
 
Fluid Mechanics - Fluid Properties
Fluid Mechanics - Fluid PropertiesFluid Mechanics - Fluid Properties
Fluid Mechanics - Fluid Properties
 
Reciprocating Pump
Reciprocating PumpReciprocating Pump
Reciprocating Pump
 
E-waste Management
E-waste ManagementE-waste Management
E-waste Management
 
Biomedical Waste Management
Biomedical Waste ManagementBiomedical Waste Management
Biomedical Waste Management
 
Hazardous Waste Management
Hazardous Waste ManagementHazardous Waste Management
Hazardous Waste Management
 
Digital Elevation Model (DEM)
Digital Elevation Model (DEM)Digital Elevation Model (DEM)
Digital Elevation Model (DEM)
 
Introduction to Solid Waste Management
Introduction to Solid Waste ManagementIntroduction to Solid Waste Management
Introduction to Solid Waste Management
 

Dernier

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Dernier (20)

fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

Unit 2 - Data Manipulation with R.pptx

  • 1. DATA MANIPULATION WITH R (DATA VISUALIZATION) Dr. P. Rambabu, M. Tech., Ph.D., F.I.E. 24-Feb-2022
  • 2. Topics 1. Introduction to Data Science 2. Prerequisites (tidyverse) 3. Import Data (readr) 4. Data Tyding (tidyr) a) pivot_longer(), pivot_wider() b) separate(), unite() 5. Data Transformation (dplyr - Grammar of Manipulation) a) arrange() b) filter() c) select() d) mutate() e) summarise() 6. Data Visualization (ggplot - Grammar of Graphics) a) Column Chart, Stacked Column Graph, Bar Graph b) Line Graph, Dual Axis Chart, Area Chart c) Pie Chart, Heat Map d) Scatter Chart, Bubble Chart
  • 3. “The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
  • 4. Introduction to Data Science Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. Typical Data Science Project will follow the below process:
  • 5. Procedure: 1. Import: First you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it! 2. Tidy: Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions. 3. Transform: Once you have tidy data, a common first step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling
  • 6. Once you have tidy data with the variables you need, there are two main engines of knowledge generation: 4. Visualization It is a fundamentally human activity. A good visualization will show you things that you did not expect, or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question, or you need to collect different data. Visualizations can surprise you, but don’t scale particularly well because they require a human to interpret them. 5. Modelling Models are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature a model cannot question its own assumptions. That means a model cannot fundamentally surprise you. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
  • 7. 6. Communication The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others. Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in every part of the project. You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
  • 8. Prerequisites There are four things you need to run the code. 1. R: To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. 2. Rstudio: It is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstudio.com/download 3. Tidyverse: An R package is a collection of functions, data, and documentation that extends the capabilities of base R. 4. Other Packages *Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data. # Install the complete tidyverse with install.packages("tidyverse")
  • 9. Tidyverse – Collection of R Packages
  • 10. ggplot2 ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. dplyr dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. tidyr tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable. readr readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
  • 11. Import Data Before you can manipulate data with R, you need to import the data into R’s memory, or build a connection to the data that R can use to access the data remotely. For example, you can build a connection to data that lives in a database. How you import your data will depend on the format of the data. The most common way to store small data sets is as a plain text file. Data may also be stored in a proprietary format associated with a specific piece of software, such as SAS, SPSS, or Microsoft Excel. Data used on the internet is often stored as a JSON or XML file. Large data sets may be stored in a database or a distributed storage system. The readr package contains the most common functions in the tidyverse for importing data. The readr package is loaded when you run library(tidyverse). The tidyverse also includes the following packages for importing specific types of data. These are not loaded with library(tidyverse). You must load them individually when you need them. • DBI - connect to databases • haven - read SPSS, Stata, or SAS data • httr - access data over web APIs • jsonlite - read JSON • readxl - read Excel spreadsheets • rvest - scrape data from the web • xml2 - read XML
  • 12.
  • 13. R Built-in Data Sets R comes with several built-in data sets, which are generally used as demo data for playing with R functions. some of the most used R demo data sets: • mtcars, • iris, • ToothGrowth, • PlantGrowth and • USArrests. data() List of pre-loaded data To see the list of pre-loaded data, type the function data():
  • 14. Loading a built-in R data Load and print mtcars data as follow: # Loading data(mtcars) # Print the first 6 rows head(mtcars, 6) If you want learn more about mtcars data sets, type this: ?mtcars mtcars: Motor Trend Car Road Tests The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)
  • 15. Data Manipulation Data Scientists spend most of their time cleaning and manipulating data rather than mining or modeling them for insights. As such, it becomes important to have tools like dplyr which makes data manipulation faster and easier. dplyr is also called grammar of data manipulation.
  • 16. Tidyr for Data Transformation The goal of tidyr is to help you create tidy data. Tidy data is data where: • Every column is variable. • Every row is an observation. • Every cell is a single value.
  • 17. key functions in the tidyr package: 1. pivot_longer() lengthens data, increasing the number of rows and decreasing the number of columns (i.e., turning columns into rows) 2. pivot_wider() widens data, increasing the number of columns and decreasing the number of rows (i.e., turning rows into columns) 3. separate() separates a character column into multiple columns with a regular expression or numeric locations 4. unite() unites multiple columns into one by pasting strings together Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Installation: # The easiest way to get tidyr is to install the whole tidyverse: install.packages("tidyverse") # Alternatively, install just tidyr: install.packages("tidyr") library(tidyr)
  • 18. 1. pivot_longer() This function "lengthens" data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider().
  • 19. 2. pivot_wider() This funciton "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().
  • 20. #create data frame player <- c('A', 'A', 'B', 'B', 'C', 'C') df <- data.frame(player, year=c(1, 2, 1, 2, 1, 2), stats=c('22-2', '29-3', '18-6', '11-8', '12-5', '19-2'), date = c("27/01/2015","23/02/2015", "31/03/2015","20/01/2015", "23/02/2015", "31/01/2015")) #view data frame df
  • 21. 3. separate() It converts longer data to a wider format. The separate() function turns a single character column into multiple columns. #separate stats column into points and assists columns a <- separate(df, col=date, into=c('date', 'month','year'), sep='/') print(a) 4. unite() It merges two columns into one column. The unite() function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable. b <-unite(a,Date, c(date, month, year), sep = ".") print(b)
  • 22. Data Manipulation using dplyr: Library(tydyverse) Dplyr verbs dplyr provides a set of verbs that help us solve the most common data manipulation challenges while working with tabular data (dataframes, tibbles): library(dplyr) OR 1. select( ) –used to select columns of interest from a data set 2. arrange( ) –used to arrange data set values on ascending or descending order 3. filter( ) – used to select the row’s by filters the data based on a condition 4. mutate( ) – used to create new variables or columns, its values are based on existing columns. 5. summarise( ) – used to perform analysis by commonly used operations such as min, max, mean count etc. # load flights dataset from nyflights13 which is a tibble flights <- nycflights13::flights head(flights) Tibbles are the core data structure of the tidyverse and is used to facilitate the display and analysis of information in a tidy format. Tibbles is a new form of data frame where data frames are the most common data structures used to store data sets in R. Different ways to create Tibbles 1. as_tibble(): The first function is as tibble function. This function is used to create a tibble from an existing data frame. 2. tibble(): The second way is to use tibble() function used to create a tibble from scratch. 3. import(): it is used to import tidyverse’s data to create Tibbles from external data sources such as databases or CSV 4. library(): It is used to load the namespace of the package.
  • 23. 1. Select Select columns with select() It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. # Select columns by name head(select(flights, year, month, day)) # Select all columns except those from year to day (inclusive) head(select(flights, -(year:day))) There are a number of helper functions you can use within select(): • starts_with("abc"): matches names that begin with “abc”. • ends_with("xyz"): matches names that end with “xyz”. • contains("ijk"): matches names that contain “ijk”. • matches("(.)1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. • num_range("x", 1:3): matches x1, x2 and x3.
  • 24. Rename: It is useful to rename variable #Use rename() rename ‘talinum’ as ‘tail_num’ head(rename(flights, tail_num = tailnum)) everything(): it is useful if you have a handful of variables you’d like to move to the start of the data frame. #move ‘time_hour’, ‘air_time’ to the begin head(select(flights, time_hour, air_time, everything()))
  • 25. 2. Arrange Reorder the rows (arrange()). arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns: # Sort data by year, month, day head(arrange(flights, year, month, day)) #Use desc() to re-order by a column in descending order: head(arrange(flights, desc(dep_delay))) #Missing values are always sorted at the end: df <- tibble(x = c(5, NA, 2)) arrange(df, x) A tibble: 3 × 1 x <dbl> 5 2 NA
  • 26. 3. Filtering Filtering provides a way to help reduce the number of rows in your tibble. When performing filtering, we can specify conditions or specific criteria that are used to reduce the number of rows in the dataset. # create flights dataset flights <- nycflights13::flights head(flights) # filter flights data by month jan_data <- filter(flights, month == 1) tail(jan_data) # filter flights data by Date and Month. Jan1 <- filter(flights, month == 1, day == 1) head(Jan1) # Filter all flights that departed in November or December: nov_dec <- filter(flights, month == 11 | month == 12) head(nov_dec) #alternatively nov_dec <- filter(flights, month %in% c(11, 12)) head(nov_dec)
  • 27. 4. Mutate Add new variables with mutate() Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. mutate() always adds new columns at the end of your dataset flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time ) head(flights_sml) head(mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60))
  • 28. Note that you can refer to columns that you’ve just created: head(mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )) If you only want to keep the new variables, use transmute(): head(transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours ))
  • 29. Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with: head(transmute(flights, dep_time, hour = dep_time %/% 60, minute = dep_time %% 60 ))
  • 30. 5. Summarise # summarize - it collapses a data frame to a single row summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) by_day <- group_by(flights, year, month, day) head(summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))) Grouped summaries with summarise() This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”
  • 31. Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this: by_dest <- group_by(flights, dest) head(delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) )) delays <- flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") head(delays) Combining multiple operations with the pipe
  • 32. Data Visualization Grammar of Graphics ggplot2 package in R Programming Language also termed as Grammar of Graphics is a free, open-source, and easy-to-use visualization package widely used in R. It is the most powerful visualization package written by Hadley Wickham. It includes several layers on which it is governed. The layers are as follows: Building Blocks of layers with the grammar of graphics 1. Data: The element is the data set itself 2. Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, line type 3. Geometrics: How our data being displayed using point, line, histogram, bar, boxplot 4. Facets: It displays the subset of the data using Columns and rows 5. Statistics: Binning, smoothing, descriptive, intermediate 6. Coordinates: the space between data and display using Cartesian, fixed, polar, limits 7. Themes: Non-data link
  • 33.
  • 34.
  • 35. ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <COORDINATE_FUNCTION> + <FACET_FUNCTION> The Layered Grammar of Graphics
  • 37. Data Visualization mpg data frame found in ggplot2 (aka ggplot2::mpg). Note: 1. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). 2. mpg contains observations collected by the US Environmental Protection Agency on 38 models of car. # to get know about mpg data (Fuel economy data from 1999 to 2008 for 38 popular models of cars) ?mpg This dataset contains a subset of the fuel economy data that the EPA makes available on <URL: https://fueleconomy.gov/>. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car. A data frame with 234 rows and 11 variables: 1. manufacturer - manufacturer name 2. model - model name 3. displ - engine displacement, in litres 4. year - year of manufacture 5. cyl - number of cylinders 6. trans - type of transmission 7. drv - the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd 8. cty - city miles per gallon 9. hwy - highway miles per gallon 10. fl - fuel type 11. class - "type" of car
  • 38. 1. Column Chart A column chart is used to show a comparison among different items, or it can show a comparison of items over time. You could use this format to see the revenue per landing page or customers by close date. Design Best Practices for Column Charts: 1. Use consistent colors throughout the chart, selecting accent colors to highlight meaningful data points or changes over time. 2. Use horizontal labels to improve readability. 3. Start the y-axis at 0 to appropriately reflect the values in your graph.
  • 39. A histogramrepresents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range.
  • 40. 2. Bar Graph A bar graph, basically a horizontal column chart, should be used to avoid clutter when one data label is long or if you have more than 10 items to compare. This type of visualization can also be used to display negative numbers Design Best Practices for Bar Graphs: 1. Use consistent colors throughout the chart, selecting accent colors to highlight meaningful data points or changes over time. 2. Use horizontal labels to improve readability. 3. Start the y-axis at 0 to appropriately reflect the values in your graph.
  • 41. 3. Line Graph A line graph reveals trends or progress over time and can be used to show many different categories of data. You should use it when you chart a continuous data set. Design Best Practices for Line Graphs: 1. Use solid lines only. 2. Don't plot more than four lines to avoid visual distractions. 3. Use the right height so the lines take up roughly 2/3 of the y-axis' height.
  • 42. 4. Dual Axis Chart A dual axis chart allows you to plot data using two y-axes and a shared x-axis. It's used with three data sets, one of which is based on a continuous set of data and another which is better suited to being grouped by category. This should be used to visualize a correlation or the lack thereof between these three data sets. Design Best Practices for Dual Axis Charts: 1. Use the y-axis on the left side for the primary variable because brains are naturally inclined to look left first. 2. Use different graphing styles to illustrate the two data sets, as illustrated above. 3. Choose contrasting colors for the two data sets
  • 43. 5. Area Chart An area chart is basically a line chart, but the space between the x-axis and the line is filled with a color or pattern. It is useful for showing part-to-whole relations, such as showing individual sales reps' contribution to total sales for a year. It helps you analyze both overall and individual trend information Design Best Practices for Area Charts: 1. Use transparent colors so information isn't obscured in the background. 2. Don't display more than four categories to avoid clutter. 3. Organize highly variable data at the top of the chart to make it easy to read.
  • 44. 6. Stacked Bar Chart This should be used to compare many different items and show the composition of each item being compared. Design Best Practices for Stacked Bar Graphs: 1. Best used to illustrate part-to- whole relationships. 2. Use contrasting colors for greater clarity. 3. Make chart scale large enough to view group sizes in relation to one another.
  • 45. 7. Pie Chart A pie chart shows a static number and how categories represent part of a whole -- the composition of something. A pie chart represents numbers in percentages, and the total sum of all segments needs to equal 100%. Design Best Practices for Pie Charts: 1. Don't illustrate too many categories to ensure differentiation between slices. 2. Ensure that the slice values add up to 100%. 3. Order slices according to their size.
  • 46. 8. Scatter Plot Chart A scatter plot or scattergram chart will show the relationship between two different variables or it can reveal the distribution trends. It should be used when there are many different data points, and you want to highlight similarities in the data set. This is useful when looking for outliers or for understanding the distribution of your data. Design Best Practices for Scatter Plots: 1. Include more variables, such as different sizes, to incorporate more data. 2. Start y-axis at 0 to represent data accurately. 3. If you use trend lines, only use a maximum of two to make your plot easy to understand.
  • 47. 9. Bubble Chart A bubble chart is similar to a scatter plot in that it can show distribution or relationship. There is a third data set, which is indicated by the size of the bubble or circle Design Best Practices for Bubble Charts: 1. Scale bubbles according to area, not diameter. 2. Make sure labels are clear and visible. 3. Use circular shapes only.
  • 48. 10. Heat Map A heat map shows the relationship between two items and provides rating information, such as high to low or poor to excellent. The rating information is displayed using varying colors or saturation. Design Best Practices for Heat Map: 1. Use a basic and clear map outline to avoid distracting from the data. 2. Use a single color in varying shades to show changes in data. 3. Avoid using multiple patterns.
  • 49. Types of Charts and their Suitability
  • 50.
  • 51.
  • 52. Creating a ggplot (Scatter Plot) To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:
  • 53. Aesthetic To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.
  • 54.
  • 55.
  • 56. Facets One way to add additional variables is with aesthetics. Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data. To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.
  • 57. To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.
  • 58. Statistical transformations Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
  • 59. Position adjustments There’s one more piece of magic associated with bar charts. You can colour a bar chart using either the colour aesthetic, or, more usefully, fill:
  • 60. bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.
  • 61. The identity position adjustment is more useful for 2d geoms, like points, where it is the default. position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
  • 62. position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.
  • 63. Histogram A histogram contains a rectangular area to display the statistical information which is proportional to the frequency of a variable and its width in successive numerical intervals. A graphical representation that manages a group of data points into different specified ranges. It has a special feature which shows no gaps between the bars and is similar to a vertical bar graph. library(ggplot2) # Change colors ggplot(mpg, aes(x=displ)) + geom_histogram(color="black", fill="white")
  • 64. Coordinate systems Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful. Indented block. coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.
  • 65.
  • 66. coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
  • 67. Theme Layer This layer controls the finer points of display like the font size and background color properties.
  • 68. Dr. Rambabu Palaka Professor School of Engineering Malla Reddy University, Hyderabad Mobile: +91-9652665840 Email: drrambabu@mallareddyuniversity.ac.in Reference: R for Data Science (https://r4ds.had.co.nz)