Learn how to manipulate data frames using the dplyr package by Hadley Wickham. This session will cover select, filter, summarize, tally, group_by, and mutate. Based on the data carpentry ecology lessons
3. Outline
• 6 verbs for data manipulation
• (select, filter, mutate, group_by, summarize, tally)
• Combining verbs with pipes %>%
• Cleaning and exporting data (is.na, write.csv)
4. Setup a working directory
• Start RStudio
• File > New project > New directory > Empty project
• Enter a name for this new folder and choose a convenient
location for it (working directory)
• Click on “Create project”
• Create a data folder in your working directory
• Create a new R script (File > New File > R script) and save it
in your working directory
5. (Down)loading data
• Can download using download.file
• download.file("https://ndownloader.figshare.com/files/2292169",
"data/portal_data_joined.csv")
• Read data using read.csv function
• surveys <- read.csv('data/portal_data_joined.csv')
6. Installing and loading packages
install.packages(“dplyr”)
• Installs the package
• One time only (on each
computer)
library(”dplyr”)
• Loads the package
• Every time you start up R*
• Unless you’re using a project.
7. What is dplyr?
• A package that provides easy tools for data manipulation
• Built for data frames
• Written in C++ (so it’s faster)
• Can work directly with external DBs – eliminates the limitation
that all data must be loaded into working memory
8. select()
• Selects columns from a data frame
• Arguments
• Data frame
• The columns you’d like to keep
• Example: select(surveys, plot_id, species_id, weight)
9. filter()
• Choose rows based on a specific criterion
• Arguments:
• Data frame
• Relational expression (returns true/false)
• >, <, >=, <=, ==, !=
• Example: filter(surveys, year == 1995)
10. Pipes %>%
• Allows you to combine multiple “verb” operations
• Syntax: %>% at the end of the line
• Output of the first line becomes in put of next line, etc.
• Final output to the screen or a variable
• Example: surveys %>%
• filter(weight<5) %>%
• select(species_id, sex, weight)
11. Exercise #1
• Using pipes, subset the survey data to include individuals
collected before 1995 and retain only the columns year, sex,
and weight.
12. mutate()
• Creates a new column, assigns a value
• Arguments:
• Data frame
• Name of new column = value
• Example: mutate(surveys, weight_kg = weight/1000)
13. Exercise #2
• Create a new data frame from the survey data that
meets the following criteria:
1. contains only the species_id column and a new column
called hindfoot_half
2. hindfood_half contains values that are half
the hindfoot_length values.
3. In this hindfoot_half column, there are no NAs and all
values are less than 30.
• Hint: think about how the commands should be ordered to
produce this data frame!
14. group_by()
• Groups data in the table by an attribute
• Arguments
• Data frame
• Factor variable to group by
• Example: group_by(surveys, sex)
15. summarize()
• Applies a function to a variable
• Arguments
• Data frame
• Definition of a summary statistic
• Example: summarize(data*, mean_weight = mean(weight))
• *Data must be a tbl_df: data<-tbl_df(surveys)
16. Split-apply-combine w/summarize
• Calculate summary statistics based on a factor variable
• Arguments:
• Data frame
• Factor variable
• Definition of a summary statistic
• Output: a table of the summary stat for each attribute
• Example: grouped_surveys<-surveys %>%
• group_by(sex) %>%
• summarize(mean_weight = mean(weight, na.rm = TRUE))
17. tally
• Count the number of observations for each factor
• Arguments
• Data frame
• Factor variable
• Example: surveys %>%
• group_by(sex) %>%
• tally
18. Exercise #3
• How many individuals were caught in each plot_type surveyed?
• Use group_by() and summarize() to find the mean, min, and max
hindfoot length for each species (using species_id).
• What was the heaviest animal measured in each year? Return the
columns year, genus, species_id, and weight.
• You saw above how to count the number of individuals of
each sex using a combination of group_by() and tally(). How could
you get the same result using group_by() and summarize()?
• Hint: see ?n.
19. Data cleaning: remove NA
surveys_complete <- surveys %>%
filter(species_id != "", # remove missing species_id
!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
sex != "") # remove missing sex
20. Data Cleaning: eliminate rare species
## Extract the most common species_id
species_counts <- surveys_complete %>%
group_by(species_id) %>%
tally %>%
filter(n >= 50)
## Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
21. write.csv()
• Writes a data table to a file
• Arguments:
• Data frame
• Output file
• Whether to include row names (optional)
• Example: write.csv(surveys_complete,
• file = ”surveys_complete.csv",
• row.names=FALSE)
22. Need help?
• Email: tobin.magle@colostate.edu
• Data Management Services website:
http://lib.colostate.edu/services/data-management
• Data Carpentry: http://www.datacarpentry.org/
• R Ecology Lesson:
http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html
• Data wrangling cheat sheet: http://www.rstudio.com/wp-
content/uploads/2015/02/data-wrangling-cheatsheet.pdf