Data wrangling with dplyr

Data Wrangling
using dplyr
C. Tobin Magle, PhD
Based on
http://www.datacarpentry.org/R-
ecology-lesson/03-dplyr.html

Hypothesis
Raw
data
Experimental
design
Tidy
Data
ResultsArticle
Processing/
Cleaning
Analysis
Open Data
Code
The research cycle

Outline
• 6 verbs for data manipulation
• (select, filter, mutate, group_by, summarize, tally)
• Combining verbs with pipes %>%
• Cleaning and exporting data (is.na, write.csv)

Setup a working directory
• Start RStudio
• File > New project > New directory > Empty project
• Enter a name for this new folder and choose a convenient
location for it (working directory)
• Click on “Create project”
• Create a data folder in your working directory
• Create a new R script (File > New File > R script) and save it
in your working directory

(Down)loading data
• Can download using download.file
• download.file("https://ndownloader.figshare.com/files/2292169",
"data/portal_data_joined.csv")
• Read data using read.csv function
• surveys <- read.csv('data/portal_data_joined.csv')

Installing and loading packages
install.packages(“dplyr”)
• Installs the package
• One time only (on each
computer)
library(”dplyr”)
• Loads the package
• Every time you start up R*
• Unless you’re using a project.

What is dplyr?
• A package that provides easy tools for data manipulation
• Built for data frames
• Written in C++ (so it’s faster)
• Can work directly with external DBs – eliminates the limitation
that all data must be loaded into working memory

select()
• Selects columns from a data frame
• Arguments
• Data frame
• The columns you’d like to keep
• Example: select(surveys, plot_id, species_id, weight)

filter()
• Choose rows based on a specific criterion
• Arguments:
• Data frame
• Relational expression (returns true/false)
• >, <, >=, <=, ==, !=
• Example: filter(surveys, year == 1995)

Pipes %>%
• Allows you to combine multiple “verb” operations
• Syntax: %>% at the end of the line
• Output of the first line becomes in put of next line, etc.
• Final output to the screen or a variable
• Example: surveys %>%
• filter(weight<5) %>%
• select(species_id, sex, weight)

Exercise #1
• Using pipes, subset the survey data to include individuals
collected before 1995 and retain only the columns year, sex,
and weight.

mutate()
• Creates a new column, assigns a value
• Arguments:
• Data frame
• Name of new column = value
• Example: mutate(surveys, weight_kg = weight/1000)

Exercise #2
• Create a new data frame from the survey data that
meets the following criteria:
1. contains only the species_id column and a new column
called hindfoot_half
2. hindfood_half contains values that are half
the hindfoot_length values.
3. In this hindfoot_half column, there are no NAs and all
values are less than 30.
• Hint: think about how the commands should be ordered to
produce this data frame!

group_by()
• Groups data in the table by an attribute
• Arguments
• Data frame
• Factor variable to group by
• Example: group_by(surveys, sex)

summarize()
• Applies a function to a variable
• Arguments
• Data frame
• Definition of a summary statistic
• Example: summarize(data*, mean_weight = mean(weight))
• *Data must be a tbl_df: data<-tbl_df(surveys)

Split-apply-combine w/summarize
• Calculate summary statistics based on a factor variable
• Arguments:
• Data frame
• Factor variable
• Definition of a summary statistic
• Output: a table of the summary stat for each attribute
• Example: grouped_surveys<-surveys %>%
• group_by(sex) %>%
• summarize(mean_weight = mean(weight, na.rm = TRUE))

tally
• Count the number of observations for each factor
• Arguments
• Data frame
• Factor variable
• Example: surveys %>%
• group_by(sex) %>%
• tally

Exercise #3
• How many individuals were caught in each plot_type surveyed?
• Use group_by() and summarize() to find the mean, min, and max
hindfoot length for each species (using species_id).
• What was the heaviest animal measured in each year? Return the
columns year, genus, species_id, and weight.
• You saw above how to count the number of individuals of
each sex using a combination of group_by() and tally(). How could
you get the same result using group_by() and summarize()?
• Hint: see ?n.

Data cleaning: remove NA
surveys_complete <- surveys %>%
filter(species_id != "", # remove missing species_id
!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
sex != "") # remove missing sex

Data Cleaning: eliminate rare species
## Extract the most common species_id
species_counts <- surveys_complete %>%
group_by(species_id) %>%
tally %>%
filter(n >= 50)
## Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)

write.csv()
• Writes a data table to a file
• Arguments:
• Data frame
• Output file
• Whether to include row names (optional)
• Example: write.csv(surveys_complete,
• file = ”surveys_complete.csv",
• row.names=FALSE)

Need help?
• Email: tobin.magle@colostate.edu
• Data Management Services website:
http://lib.colostate.edu/services/data-management
• Data Carpentry: http://www.datacarpentry.org/
• R Ecology Lesson:
http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html
• Data wrangling cheat sheet: http://www.rstudio.com/wp-
content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Data wrangling with dplyr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data wrangling with dplyr

Similar to Data wrangling with dplyr (20)

More from C. Tobin Magle

More from C. Tobin Magle (12)

Recently uploaded

Recently uploaded (20)

Data wrangling with dplyr