SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Stat405
     Advanced data manipulation 2


                              Hadley Wickham
Thursday, 30 September 2010
1. Colloquium Monday
               2. String basics
               3. Group-wise transformations
               4. Practice challenges




Thursday, 30 September 2010
Colloquium on
           Summer research experiences


                              When: Monday, 4:00 - 5:00
                              Where: DH 1070

                              Coffee and Cookies will be
                                served ahead of time


Thursday, 30 September 2010
Speakers

               1. James Rigby -summer institute in bioinformatics
               2. Gabi Quart - internship at Deutche Bank
               3. Liz Jackson - Survey methodology summer program
               4. Ollie McDonald - internship at Novartis in
                  Switzerland
               5. Christine Peterson - research in the Med center
               6. Joseph Egbulefu - research at Rice



Thursday, 30 September 2010
String
                              basics
Thursday, 30 September 2010
install.packages("stringr")
     library(stringr)

     str_length("Hadley")
     str_c(letters, LETTERS)
     str_c(letters, LETTERS, sep = " ")
     str_c("H", "a", "d", "l", "e", "y", collapse = "")

     tolower("Hadley")
     toupper("Hadley")

     str_sub("Hadley", 1, 3)
     str_sub("Hadley", -1)


Thursday, 30 September 2010
Your turn

                   Add new columns that give:
                   The length of each name
                   The first letter of the name
                   The last letter of the name




Thursday, 30 September 2010
library(stringr)

     bnames <- read.csv("baby-names2.csv.bz2",
       stringsAsFactors = FALSE)

     bnames <- transform(bnames,
       length = str_length(name),
       first = str_sub(name, 1, 1),
       last = str_sub(name, -1, -1))




Thursday, 30 September 2010
Your turn


                   Explore how the average length of names
                   has changed over time (for each sex)




Thursday, 30 September 2010
library(plyr)
     sy <- ddply(bnames, c("sex", "year"), summarise,
       avg_length = weighted.mean(length, prop))

     library(ggplot2)
     qplot(year, avg_length, data = sy, colour = sex,
       geom = "line")




Thursday, 30 September 2010
# Another approach
     syl <- ddply(bnames, c("sex", "length", "year"),
       summarise, prop = sum(prop))
     qplot(year, prop, data = syl, colour = sex,
       geom = "line") + facet_wrap(~ length)




Thursday, 30 September 2010
Transformations


Thursday, 30 September 2010
Transformations

                   What about group-wise
                   transformations? e.g. what if we want to
                   compute the rank of a name within a sex
                   and year?
                   This task is easy if we have a single year
                   & sex, but hard otherwise.



Thursday, 30 September 2010
Transformations

                   What about group-wise
                   transformations? e.g. what if we want to
                   compute the rank of a name within a sex
                   and year?
                   This task is easy if we have a single year
                   & sex, but hard otherwise.


                         How would you do it for a single group?
Thursday, 30 September 2010
one <- subset(bnames, sex == "boy" & year == 2008)
     one$rank <- rank(-one$prop,
       ties.method = "first")

     # or
     one <- transform(one,
       rank = rank(-prop, ties.method = "first"))
     head(one)




                              What if we want to transform
                              every sex and year?
Thursday, 30 September 2010
Workflow

                   1. Extract a single group
                   2. Figure out how to solve it for just that
                      group
                   3. Use ddply to solve it for all groups




Thursday, 30 September 2010
Workflow

                   1. Extract a single group
                   2. Figure out how to solve it for just that
                      group
                   3. Use ddply to solve it for all groups



             How would you use ddply to calculate all ranks?
Thursday, 30 September 2010
bnames <- ddply(bnames, c("sex", "year"), transform,
       rank = rank(-prop, ties.method = "first"))




Thursday, 30 September 2010
ddply + transform =
                              group-wise transformation

                                ddply + summarise =
                                per-group summaries

                                 ddply + subset =
                                 per-group subsets


Thursday, 30 September 2010
Tools

                   You know have all the tools to solve 95%
                   of data manipulation problems in R. It’s
                   just a matter of figuring out which tools to
                   use, and how to combine them.
                   The following challenges will give you
                   some practice.



Thursday, 30 September 2010
Challenges

Thursday, 30 September 2010
Warmups

                   Which names were most popular in 1999?
                   Work out the average proportion for each
                   name.
                   List the 10 names with the highest
                   average proportions.



Thursday, 30 September 2010
# Which names were most popular in 1999?
     subset(bnames, year == 1999 & rank < 10)
     subset(bnames, year == 1999 & prop == max(prop))

     # Average usage
     overall <- ddply(bnames, "name", summarise,
       prop = mean(prop))

     # Top 10 names
     head(arrange(overall, desc(prop)), 10)




Thursday, 30 September 2010
Challenge 1

                   How has the total proportion of babies
                   with names in the top 1000 changed over
                   time?
                   How has the popularity of different initials
                   changed over time?




Thursday, 30 September 2010
sy <- ddply(bnames, c("year","sex"), summarise,
       prop = sum(prop),
       npop = sum(prop > 1/1000))

     qplot(year, prop, data = sy, colour = sex,
       geom = "line")
     qplot(year, npop, data = sy, colour = sex,
       geom = "line")




Thursday, 30 September 2010
Challenge 2

                   For each name, find the year in which it
                   was most popular, and the rank in that
                   year. (Hint: you might find which.max
                   useful).
                   Print all names that have been the most
                   popular name at least once.



Thursday, 30 September 2010
most_pop <- ddply(bnames, "name", summarise,
       year = year[which.max(prop)],
       rank = min(rank))
     most_pop <- ddply(bnames, "name", subset,
       prop == max(prop))

     subset(most_pop, rank == 1)

     # Double challenge: Why is this last one wrong?




Thursday, 30 September 2010
Challenge 3

                   What name has been in the top 10 most
                   often?
                   (Hint: you'll have to do this in three steps.
                   Think about what they are before starting)




Thursday, 30 September 2010
top10 <- subset(bnames, rank <= 10)
     counts <- count(top10, c("sex", "name"))

     ddply(counts, "sex", subset, freq == max(freq))
     head(arrange(counts, desc(freq)), 10)




Thursday, 30 September 2010
Homework

                   No homework this week.
                   Use what you’ve learned to make your
                   projects even better!




Thursday, 30 September 2010

Contenu connexe

Similaire à 12 adv-manip

Presentationskillsforteachersversion3 0-100830062322-phpapp01
Presentationskillsforteachersversion3 0-100830062322-phpapp01Presentationskillsforteachersversion3 0-100830062322-phpapp01
Presentationskillsforteachersversion3 0-100830062322-phpapp01Suzanne Ott
 
Presentation Skills for Teachers version 3.0
Presentation Skills for Teachers  version 3.0Presentation Skills for Teachers  version 3.0
Presentation Skills for Teachers version 3.0Simon Jones
 
Creating a Successful School Blog
Creating a Successful School BlogCreating a Successful School Blog
Creating a Successful School BlogPeter Baron
 
MongoDB & Mongomapper 4 real
MongoDB & Mongomapper 4 realMongoDB & Mongomapper 4 real
MongoDB & Mongomapper 4 realjan_mindmatters
 

Similaire à 12 adv-manip (8)

13 case-study
13 case-study13 case-study
13 case-study
 
12 Ddply
12 Ddply12 Ddply
12 Ddply
 
Presentationskillsforteachersversion3 0-100830062322-phpapp01
Presentationskillsforteachersversion3 0-100830062322-phpapp01Presentationskillsforteachersversion3 0-100830062322-phpapp01
Presentationskillsforteachersversion3 0-100830062322-phpapp01
 
Presentation Skills for Teachers version 3.0
Presentation Skills for Teachers  version 3.0Presentation Skills for Teachers  version 3.0
Presentation Skills for Teachers version 3.0
 
Creating a Successful School Blog
Creating a Successful School BlogCreating a Successful School Blog
Creating a Successful School Blog
 
MongoDB & Mongomapper 4 real
MongoDB & Mongomapper 4 realMongoDB & Mongomapper 4 real
MongoDB & Mongomapper 4 real
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 

Plus de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
06 data
06 data06 data
06 data
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 

12 adv-manip

  • 1. Stat405 Advanced data manipulation 2 Hadley Wickham Thursday, 30 September 2010
  • 2. 1. Colloquium Monday 2. String basics 3. Group-wise transformations 4. Practice challenges Thursday, 30 September 2010
  • 3. Colloquium on Summer research experiences When: Monday, 4:00 - 5:00 Where: DH 1070 Coffee and Cookies will be served ahead of time Thursday, 30 September 2010
  • 4. Speakers 1. James Rigby -summer institute in bioinformatics 2. Gabi Quart - internship at Deutche Bank 3. Liz Jackson - Survey methodology summer program 4. Ollie McDonald - internship at Novartis in Switzerland 5. Christine Peterson - research in the Med center 6. Joseph Egbulefu - research at Rice Thursday, 30 September 2010
  • 5. String basics Thursday, 30 September 2010
  • 6. install.packages("stringr") library(stringr) str_length("Hadley") str_c(letters, LETTERS) str_c(letters, LETTERS, sep = " ") str_c("H", "a", "d", "l", "e", "y", collapse = "") tolower("Hadley") toupper("Hadley") str_sub("Hadley", 1, 3) str_sub("Hadley", -1) Thursday, 30 September 2010
  • 7. Your turn Add new columns that give: The length of each name The first letter of the name The last letter of the name Thursday, 30 September 2010
  • 8. library(stringr) bnames <- read.csv("baby-names2.csv.bz2", stringsAsFactors = FALSE) bnames <- transform(bnames, length = str_length(name), first = str_sub(name, 1, 1), last = str_sub(name, -1, -1)) Thursday, 30 September 2010
  • 9. Your turn Explore how the average length of names has changed over time (for each sex) Thursday, 30 September 2010
  • 10. library(plyr) sy <- ddply(bnames, c("sex", "year"), summarise, avg_length = weighted.mean(length, prop)) library(ggplot2) qplot(year, avg_length, data = sy, colour = sex, geom = "line") Thursday, 30 September 2010
  • 11. # Another approach syl <- ddply(bnames, c("sex", "length", "year"), summarise, prop = sum(prop)) qplot(year, prop, data = syl, colour = sex, geom = "line") + facet_wrap(~ length) Thursday, 30 September 2010
  • 13. Transformations What about group-wise transformations? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise. Thursday, 30 September 2010
  • 14. Transformations What about group-wise transformations? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise. How would you do it for a single group? Thursday, 30 September 2010
  • 15. one <- subset(bnames, sex == "boy" & year == 2008) one$rank <- rank(-one$prop, ties.method = "first") # or one <- transform(one, rank = rank(-prop, ties.method = "first")) head(one) What if we want to transform every sex and year? Thursday, 30 September 2010
  • 16. Workflow 1. Extract a single group 2. Figure out how to solve it for just that group 3. Use ddply to solve it for all groups Thursday, 30 September 2010
  • 17. Workflow 1. Extract a single group 2. Figure out how to solve it for just that group 3. Use ddply to solve it for all groups How would you use ddply to calculate all ranks? Thursday, 30 September 2010
  • 18. bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-prop, ties.method = "first")) Thursday, 30 September 2010
  • 19. ddply + transform = group-wise transformation ddply + summarise = per-group summaries ddply + subset = per-group subsets Thursday, 30 September 2010
  • 20. Tools You know have all the tools to solve 95% of data manipulation problems in R. It’s just a matter of figuring out which tools to use, and how to combine them. The following challenges will give you some practice. Thursday, 30 September 2010
  • 22. Warmups Which names were most popular in 1999? Work out the average proportion for each name. List the 10 names with the highest average proportions. Thursday, 30 September 2010
  • 23. # Which names were most popular in 1999? subset(bnames, year == 1999 & rank < 10) subset(bnames, year == 1999 & prop == max(prop)) # Average usage overall <- ddply(bnames, "name", summarise, prop = mean(prop)) # Top 10 names head(arrange(overall, desc(prop)), 10) Thursday, 30 September 2010
  • 24. Challenge 1 How has the total proportion of babies with names in the top 1000 changed over time? How has the popularity of different initials changed over time? Thursday, 30 September 2010
  • 25. sy <- ddply(bnames, c("year","sex"), summarise, prop = sum(prop), npop = sum(prop > 1/1000)) qplot(year, prop, data = sy, colour = sex, geom = "line") qplot(year, npop, data = sy, colour = sex, geom = "line") Thursday, 30 September 2010
  • 26. Challenge 2 For each name, find the year in which it was most popular, and the rank in that year. (Hint: you might find which.max useful). Print all names that have been the most popular name at least once. Thursday, 30 September 2010
  • 27. most_pop <- ddply(bnames, "name", summarise, year = year[which.max(prop)], rank = min(rank)) most_pop <- ddply(bnames, "name", subset, prop == max(prop)) subset(most_pop, rank == 1) # Double challenge: Why is this last one wrong? Thursday, 30 September 2010
  • 28. Challenge 3 What name has been in the top 10 most often? (Hint: you'll have to do this in three steps. Think about what they are before starting) Thursday, 30 September 2010
  • 29. top10 <- subset(bnames, rank <= 10) counts <- count(top10, c("sex", "name")) ddply(counts, "sex", subset, freq == max(freq)) head(arrange(counts, desc(freq)), 10) Thursday, 30 September 2010
  • 30. Homework No homework this week. Use what you’ve learned to make your projects even better! Thursday, 30 September 2010