Introduction to Data Science, Prerequisites (tidyverse), Import Data (readr), Data Tyding (tidyr),
pivot_longer(), pivot_wider(), separate(), unite(), Data Transformation (dplyr - Grammar of Manipulation): arrange(), filter(),
select(), mutate(), summarise()m
Data Visualization (ggplot - Grammar of Graphics): Column Chart, Stacked Column Graph, Bar Graph, Line Graph, Dual Axis Chart, Area Chart, Pie Chart, Heat Map, Scatter Chart, Bubble Chart
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Unit 2 - Data Manipulation with R.pptx
1. DATA MANIPULATION WITH R
(DATA VISUALIZATION)
Dr. P. Rambabu, M. Tech., Ph.D., F.I.E.
24-Feb-2022
2. Topics
1. Introduction to Data Science
2. Prerequisites (tidyverse)
3. Import Data (readr)
4. Data Tyding (tidyr)
a) pivot_longer(), pivot_wider()
b) separate(), unite()
5. Data Transformation (dplyr - Grammar of Manipulation)
a) arrange()
b) filter()
c) select()
d) mutate()
e) summarise()
6. Data Visualization (ggplot - Grammar of Graphics)
a) Column Chart, Stacked Column Graph, Bar Graph
b) Line Graph, Dual Axis Chart, Area Chart
c) Pie Chart, Heat Map
d) Scatter Chart, Bubble Chart
3. “The simple graph has brought more information to the
data analyst’s mind than any other device.” — John
Tukey
4. Introduction to Data Science
Data science is an exciting discipline that allows you to turn raw data into understanding,
insight, and knowledge.
Typical Data Science Project will follow the below process:
5. Procedure:
1. Import:
First you must import your data into R. This typically means that you take data stored in a
file, database, or web application programming interface (API), and load it into a data frame
in R. If you can’t get your data into R, you can’t do data science on it!
2. Tidy:
Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing
it in a consistent form that matches the semantics of the dataset with the way it is stored. In
brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy
data is important because the consistent structure lets you focus your struggle on questions
about the data, not fighting to get the data into the right form for different functions.
3. Transform:
Once you have tidy data, a common first step is to transform it. Transformation includes
narrowing in on observations of interest (like all people in one city, or all data from the last
year), creating new variables that are functions of existing variables (like computing speed
from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called wrangling
6. Once you have tidy data with the variables you need, there are two main engines of
knowledge generation:
4. Visualization
It is a fundamentally human activity. A good visualization will show you things that you did not
expect, or raise new questions about the data. A good visualization might also hint that you’re
asking the wrong question, or you need to collect different data. Visualizations can surprise
you, but don’t scale particularly well because they require a human to interpret them.
5. Modelling
Models are complementary tools to visualization. Once you have made your questions
sufficiently precise, you can use a model to answer them. Models are a fundamentally
mathematical or computational tool, so they generally scale well. Even when they don’t, it’s
usually cheaper to buy more computers than it is to buy more brains! But every model makes
assumptions, and by its very nature a model cannot question its own assumptions. That
means a model cannot fundamentally surprise you.
These have complementary strengths and weaknesses so any real analysis will iterate
between them many times.
7. 6. Communication
The last step of data science is communication, an absolutely critical part of any
data analysis project. It doesn’t matter how well your models and visualisation have
led you to understand the data unless you can also communicate your results to
others.
Surrounding all these tools is programming. Programming is a cross-cutting tool
that you use in every part of the project. You don’t need to be an expert programmer
to be a data scientist, but learning more about programming pays off because
becoming a better programmer allows you to automate common tasks, and solve
new problems with greater ease.
8. Prerequisites
There are four things you need to run the code.
1. R: To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set
of mirror servers distributed around the world and is used to distribute R and R packages.
2. Rstudio: It is an integrated development environment, or IDE, for R programming. Download and
install it from http://www.rstudio.com/download
3. Tidyverse: An R package is a collection of functions, data, and documentation that extends the
capabilities of base R.
4. Other Packages
*Packages are the fundamental units of reproducible R code. They include reusable functions, the
documentation that describes how to use them, and sample data.
# Install the complete tidyverse with
install.packages("tidyverse")
10. ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell
ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
dplyr
dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data
manipulation challenges.
tidyr
tidyr provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every
variable goes in a column, and every column is a variable.
readr
readr provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf). It is designed to flexibly parse
many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
11. Import Data
Before you can manipulate data with R, you need to import the data into R’s memory, or build a connection to the data
that R can use to access the data remotely. For example, you can build a connection to data that lives in a database.
How you import your data will depend on the format of the data. The most common way to store small data sets is as
a plain text file. Data may also be stored in a proprietary format associated with a specific piece of software, such as
SAS, SPSS, or Microsoft Excel. Data used on the internet is often stored as a JSON or XML file. Large data sets may be
stored in a database or a distributed storage system.
The readr package contains the most common functions in the tidyverse for importing data. The readr package is
loaded when you run library(tidyverse). The tidyverse also includes the following packages for importing specific types
of data. These are not loaded with library(tidyverse). You must load them individually when you need them.
• DBI - connect to databases
• haven - read SPSS, Stata, or SAS data
• httr - access data over web APIs
• jsonlite - read JSON
• readxl - read Excel spreadsheets
• rvest - scrape data from the web
• xml2 - read XML
12.
13. R Built-in Data Sets
R comes with several built-in data sets, which are generally used as demo data for playing with R functions.
some of the most used R demo data sets:
• mtcars,
• iris,
• ToothGrowth,
• PlantGrowth and
• USArrests.
data()
List of pre-loaded data
To see the list of pre-loaded data,
type the function data():
14. Loading a built-in R data
Load and print mtcars data as follow:
# Loading
data(mtcars)
# Print the first 6 rows
head(mtcars, 6)
If you want learn more about mtcars data sets, type this:
?mtcars
mtcars: Motor Trend Car Road Tests
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of
automobile design and performance for 32 automobiles (1973–74 models)
15. Data Manipulation
Data Scientists spend most of their time cleaning and manipulating data rather than mining or modeling
them for insights. As such, it becomes important to have tools like dplyr which makes data
manipulation faster and easier.
dplyr is also called grammar of data manipulation.
16. Tidyr for Data Transformation
The goal of tidyr is to help you create tidy data. Tidy data is data where:
• Every column is variable.
• Every row is an observation.
• Every cell is a single value.
17. key functions in the tidyr package:
1. pivot_longer() lengthens data, increasing the number of rows and decreasing the number of columns (i.e., turning
columns into rows)
2. pivot_wider() widens data, increasing the number of columns and decreasing the number of rows (i.e., turning
rows into columns)
3. separate() separates a character column into multiple columns with a regular expression or numeric locations
4. unite() unites multiple columns into one by pasting strings together
Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure
that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.
Installation:
# The easiest way to get tidyr is to install the whole tidyverse:
install.packages("tidyverse")
# Alternatively, install just tidyr:
install.packages("tidyr")
library(tidyr)
18. 1. pivot_longer()
This function "lengthens" data, increasing the number of rows and decreasing the number of columns.
The inverse transformation is pivot_wider().
19. 2. pivot_wider()
This funciton "widens" data, increasing the number of columns and decreasing the number of rows. The inverse
transformation is pivot_longer().
20. #create data frame
player <- c('A', 'A', 'B', 'B', 'C', 'C')
df <- data.frame(player,
year=c(1, 2, 1, 2, 1, 2),
stats=c('22-2', '29-3', '18-6', '11-8', '12-5', '19-2'),
date = c("27/01/2015","23/02/2015", "31/03/2015","20/01/2015", "23/02/2015", "31/01/2015"))
#view data frame
df
21. 3. separate()
It converts longer data to a wider format. The separate() function turns a single character column into multiple columns.
#separate stats column into points and assists columns
a <- separate(df, col=date, into=c('date', 'month','year'), sep='/')
print(a)
4. unite()
It merges two columns into one column. The unite() function is a convenience function to paste together multiple
variable values into one. In essence, it combines two variables of a single observation into one variable.
b <-unite(a,Date, c(date, month, year), sep = ".")
print(b)
22. Data Manipulation using dplyr:
Library(tydyverse)
Dplyr verbs
dplyr provides a set of verbs that help us solve the most
common data manipulation challenges while working with
tabular data (dataframes, tibbles):
library(dplyr)
OR
1. select( ) –used to select columns of interest from a data set
2. arrange( ) –used to arrange data set values on ascending or
descending order
3. filter( ) – used to select the row’s by filters the data based
on a condition
4. mutate( ) – used to create new variables or columns, its
values are based on existing columns.
5. summarise( ) – used to perform analysis by commonly used
operations such as min, max, mean count etc.
# load flights dataset from nyflights13 which is a tibble
flights <- nycflights13::flights
head(flights)
Tibbles are the core data structure of the
tidyverse and is used to facilitate the display and
analysis of information in a tidy format.
Tibbles is a new form of data frame where data
frames are the most common data structures
used to store data sets in R.
Different ways to create Tibbles
1. as_tibble(): The first function is as tibble
function. This function is used to create a
tibble from an existing data frame.
2. tibble(): The second way is to use tibble()
function used to create a tibble from
scratch.
3. import(): it is used to import tidyverse’s
data to create Tibbles from external data
sources such as databases or CSV
4. library(): It is used to load the
namespace of the package.
23. 1. Select
Select columns with select()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often
narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using
operations based on the names of the variables.
# Select columns by name
head(select(flights, year, month, day))
# Select all columns except those from year to day (inclusive)
head(select(flights, -(year:day)))
There are a number of helper functions you can use within select():
• starts_with("abc"): matches names that begin with “abc”.
• ends_with("xyz"): matches names that end with “xyz”.
• contains("ijk"): matches names that contain “ijk”.
• matches("(.)1"): selects variables that match a regular expression. This one matches any variables that contain
repeated characters.
• num_range("x", 1:3): matches x1, x2 and x3.
24. Rename: It is useful to rename variable
#Use rename() rename ‘talinum’ as ‘tail_num’
head(rename(flights, tail_num = tailnum))
everything(): it is useful if you have a handful of
variables you’d like to move to the start of the data
frame.
#move ‘time_hour’, ‘air_time’ to the begin
head(select(flights, time_hour, air_time,
everything()))
25. 2. Arrange
Reorder the rows (arrange()).
arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and
a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each
additional column will be used to break ties in the values of preceding columns:
# Sort data by year, month, day
head(arrange(flights, year, month, day))
#Use desc() to re-order by a column in descending order:
head(arrange(flights, desc(dep_delay)))
#Missing values are always sorted at the end:
df <- tibble(x = c(5, NA, 2))
arrange(df, x)
A tibble: 3 × 1
x
<dbl>
5
2
NA
26. 3. Filtering
Filtering provides a way to help reduce the number of rows in your tibble. When performing filtering, we can
specify conditions or specific criteria that are used to reduce the number of rows in the dataset.
# create flights dataset
flights <- nycflights13::flights
head(flights)
# filter flights data by month
jan_data <- filter(flights, month == 1)
tail(jan_data)
# filter flights data by Date and Month.
Jan1 <- filter(flights, month == 1, day == 1)
head(Jan1)
# Filter all flights that departed in November or December:
nov_dec <- filter(flights, month == 11 | month == 12)
head(nov_dec)
#alternatively
nov_dec <- filter(flights, month %in% c(11, 12))
head(nov_dec)
27. 4. Mutate
Add new variables with mutate()
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns.
mutate() always adds new columns at the end of your dataset
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
head(flights_sml)
head(mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60))
28. Note that you can refer to columns that you’ve just created:
head(mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
If you only want to keep the new variables, use transmute():
head(transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
29. Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights
dataset, you can compute hour and minute from dep_time with:
head(transmute(flights,
dep_time,
hour = dep_time %/% 60,
minute = dep_time %% 60
))
30. 5. Summarise
# summarize - it collapses a data frame to a single row
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(flights, year, month, day)
head(summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)))
Grouped summaries with summarise()
This changes the unit of analysis from the complete dataset to individual groups. Then,
when you use the dplyr verbs on a grouped data frame they’ll be automatically applied
“by group”
31. Imagine that we want to explore the relationship between the distance and average delay for each location. Using what
you know about dplyr, you might write code like this:
by_dest <- group_by(flights, dest)
head(delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
))
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
head(delays)
Combining multiple operations with the pipe
32. Data Visualization
Grammar of Graphics
ggplot2 package in R Programming Language also termed as Grammar of Graphics is a free, open-source, and
easy-to-use visualization package widely used in R. It is the most powerful visualization package written by
Hadley Wickham.
It includes several layers on which it is governed. The layers are as follows:
Building Blocks of layers with the grammar of graphics
1. Data: The element is the data set itself
2. Aesthetics: The data is to map onto the Aesthetics attributes such as x-axis, y-axis, color, fill, size, labels,
alpha, shape, line width, line type
3. Geometrics: How our data being displayed using point, line, histogram, bar, boxplot
4. Facets: It displays the subset of the data using Columns and rows
5. Statistics: Binning, smoothing, descriptive, intermediate
6. Coordinates: the space between data and display using Cartesian, fixed, polar, limits
7. Themes: Non-data link
33.
34.
35. ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
The Layered Grammar of Graphics
37. Data Visualization
mpg data frame found in ggplot2 (aka ggplot2::mpg).
Note:
1. A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
2. mpg contains observations collected by the US Environmental Protection Agency on 38 models of car.
# to get know about mpg data (Fuel economy data from 1999 to 2008 for 38 popular models of cars)
?mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on <URL: https://fueleconomy.gov/>. It contains only models which
had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
A data frame with 234 rows and 11 variables:
1. manufacturer - manufacturer name
2. model - model name
3. displ - engine displacement, in litres
4. year - year of manufacture
5. cyl - number of cylinders
6. trans - type of transmission
7. drv - the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
8. cty - city miles per gallon
9. hwy - highway miles per gallon
10. fl - fuel type
11. class - "type" of car
38. 1. Column Chart
A column chart is used to show a comparison among different items, or it can show a comparison of items
over time. You could use this format to see the revenue per landing page or customers by close date.
Design Best Practices for Column
Charts:
1. Use consistent colors throughout the
chart, selecting accent colors to
highlight meaningful data points or
changes over time.
2. Use horizontal labels to improve
readability.
3. Start the y-axis at 0 to appropriately
reflect the values in your graph.
39. A histogramrepresents the frequencies of values of a variable bucketed into ranges.
Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram
represents the height of the number of values present in that range.
40. 2. Bar Graph
A bar graph, basically a horizontal column chart, should be used to avoid clutter when one data label is long or
if you have more than 10 items to compare. This type of visualization can also be used to display negative
numbers
Design Best Practices for Bar
Graphs:
1. Use consistent colors throughout
the chart, selecting accent colors to
highlight meaningful data points or
changes over time.
2. Use horizontal labels to improve
readability.
3. Start the y-axis at 0 to appropriately
reflect the values in your graph.
41. 3. Line Graph
A line graph reveals trends or progress over time and can be used to show many different categories of data.
You should use it when you chart a continuous data set.
Design Best Practices for Line
Graphs:
1. Use solid lines only.
2. Don't plot more than four lines to
avoid visual distractions.
3. Use the right height so the lines
take up roughly 2/3 of the y-axis'
height.
42. 4. Dual Axis Chart
A dual axis chart allows you to plot data using two y-axes and a shared x-axis. It's used with three data sets,
one of which is based on a continuous set of data and another which is better suited to being grouped by
category. This should be used to visualize a correlation or the lack thereof between these three data sets.
Design Best Practices for Dual Axis
Charts:
1. Use the y-axis on the left side for the
primary variable because brains are
naturally inclined to look left first.
2. Use different graphing styles to
illustrate the two data sets, as
illustrated above.
3. Choose contrasting colors for the two
data sets
43. 5. Area Chart
An area chart is basically a line chart, but the space between the x-axis and the line is filled with a color or
pattern. It is useful for showing part-to-whole relations, such as showing individual sales reps' contribution to
total sales for a year. It helps you analyze both overall and individual trend information
Design Best Practices for Area Charts:
1. Use transparent colors so information
isn't obscured in the background.
2. Don't display more than four
categories to avoid clutter.
3. Organize highly variable data at the
top of the chart to make it easy to
read.
44. 6. Stacked Bar Chart
This should be used to compare many different items and show the composition of each item being compared.
Design Best Practices for
Stacked Bar Graphs:
1. Best used to illustrate part-to-
whole relationships.
2. Use contrasting colors for
greater clarity.
3. Make chart scale large
enough to view group sizes in
relation to one another.
45. 7. Pie Chart
A pie chart shows a static number and how categories represent part of a whole -- the composition of
something. A pie chart represents numbers in percentages, and the total sum of all segments needs to equal
100%.
Design Best Practices for Pie Charts:
1. Don't illustrate too many categories to
ensure differentiation between slices.
2. Ensure that the slice values add up to
100%.
3. Order slices according to their size.
46. 8. Scatter Plot Chart
A scatter plot or scattergram chart will show the relationship between two different variables or it can reveal the
distribution trends. It should be used when there are many different data points, and you want to highlight
similarities in the data set. This is useful when looking for outliers or for understanding the distribution of your
data.
Design Best Practices for Scatter Plots:
1. Include more variables, such as different
sizes, to incorporate more data.
2. Start y-axis at 0 to represent data
accurately.
3. If you use trend lines, only use a
maximum of two to make your plot easy
to understand.
47. 9. Bubble Chart
A bubble chart is similar to a scatter plot in that it can show distribution or relationship. There is a third data set,
which is indicated by the size of the bubble or circle
Design Best Practices for Bubble
Charts:
1. Scale bubbles according to area,
not diameter.
2. Make sure labels are clear and
visible.
3. Use circular shapes only.
48. 10. Heat Map
A heat map shows the relationship between two items and provides rating information, such as high to low or
poor to excellent. The rating information is displayed using varying colors or saturation.
Design Best Practices for Heat Map:
1. Use a basic and clear map outline to
avoid distracting from the data.
2. Use a single color in varying shades to
show changes in data.
3. Avoid using multiple patterns.
52. Creating a ggplot (Scatter Plot)
To plot mpg, run this code to put displ on the x-axis and hwy
on the y-axis:
53. Aesthetic
To map an aesthetic to a variable, associate the name of
the aesthetic to the name of the variable inside aes().
ggplot2 will automatically assign a unique level of the
aesthetic (here a unique color) to each unique value of
the variable, a process known as scaling.
ggplot2 will also add a legend that explains which levels
correspond to which values.
54.
55.
56. Facets
One way to add additional variables is with aesthetics.
Another way, particularly useful for categorical
variables, is to split your plot into facets, subplots that
each display one subset of the data.
To facet your plot by a single variable,
use facet_wrap().
The first argument of facet_wrap() should be a
formula, which you create with ~ followed by a variable
name (here “formula” is the name of a data structure
in R, not a synonym for “equation”). The variable that
you pass to facet_wrap() should be discrete.
57. To facet your plot on the combination of two variables,
add facet_grid() to your plot call. The first argument of
facet_grid() is also a formula. This time the formula should
contain two variable names separated by a ~.
58. Statistical transformations
Bar charts seem simple, but they are interesting because they
reveal something subtle about plots. Consider a basic bar chart,
as drawn with geom_bar().
The following chart displays the total number of diamonds in the
diamonds dataset, grouped by cut.
The diamonds dataset comes in ggplot2 and contains
information about ~54,000 diamonds, including the price, carat,
color, clarity, and cut of each diamond. The chart shows that more
diamonds are available with high quality cuts than with low quality
cuts.
59. Position adjustments
There’s one more piece of magic associated with bar charts. You can colour a bar chart using either the colour aesthetic, or, more
usefully, fill:
60. bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.
61. The identity position adjustment is more useful for 2d geoms, like points, where it is the default.
position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions
across groups.
62. position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.
63. Histogram
A histogram contains a rectangular area to display the statistical information which is proportional to the
frequency of a variable and its width in successive numerical intervals. A graphical representation that
manages a group of data points into different specified ranges. It has a special feature which shows no gaps
between the bars and is similar to a vertical bar graph.
library(ggplot2)
# Change colors
ggplot(mpg, aes(x=displ)) +
geom_histogram(color="black", fill="white")
64. Coordinate systems
Coordinate systems are probably the most
complicated part of ggplot2.
The default coordinate system is the Cartesian
coordinate system where the x and y positions act
independently to determine the location of each
point. There are a number of other coordinate
systems that are occasionally helpful.
Indented block.
coord_flip() switches the x and y axes. This is useful
(for example), if you want horizontal boxplots. It’s
also useful for long labels: it’s hard to get them to
fit without overlapping on the x-axis.
65.
66. coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a
Coxcomb chart.
67. Theme Layer
This layer controls the finer points of display like the font size and background color properties.
68. Dr. Rambabu Palaka
Professor
School of Engineering
Malla Reddy University, Hyderabad
Mobile: +91-9652665840
Email: drrambabu@mallareddyuniversity.ac.in
Reference:
R for Data Science (https://r4ds.had.co.nz)