SlideShare a Scribd company logo
1 of 20
Download to read offline
plyr
                       US baby names


                        Hadley Wickham
Tuesday, 7 July 2009
1. Introduction to the data
                2. Transformations and summaries
                3. Group-wise transformation and
                   summary
                4. Variable selection syntax
                5. Challenge


Tuesday, 7 July 2009
Baby names
                       Top 1000 male and female baby
                       names in the US, from 1880 to
                       2008.
                       258,000 records (1000 * 2 * 129)
                       But only four variables: year,
                       name, sex and percent.

                                           CC BY http://www.flickr.com/photos/the_light_show/2586781132
Tuesday, 7 July 2009
> head(bnames, 15)               > tail(bnames, 15)
           year    name percent    sex          year     name    percent    sex
        1 1880     John 0.081541   boy   257986 2008   Neveah   0.000130   girl
        2 1880 William 0.080511    boy   257987 2008   Amaris   0.000129   girl
        3 1880    James 0.050057   boy   257988 2008 Hadassah   0.000129   girl
        4 1880 Charles 0.045167    boy   257989 2008    Dania   0.000129   girl
        5 1880 George 0.043292     boy   257990 2008   Hailie   0.000129   girl
        6 1880    Frank 0.027380   boy   257991 2008   Jamiya   0.000129   girl
        7 1880 Joseph 0.022229     boy   257992 2008    Kathy   0.000129   girl
        8 1880 Thomas 0.021401     boy   257993 2008   Laylah   0.000129   girl
        9 1880    Henry 0.020641   boy   257994 2008     Riya   0.000129   girl
        10 1880 Robert 0.020404    boy   257995 2008     Diya   0.000128   girl
        11 1880 Edward 0.019965    boy   257996 2008 Carleigh   0.000128   girl
        12 1880   Harry 0.018175   boy   257997 2008    Iyana   0.000128   girl
        13 1880 Walter 0.014822    boy   257998 2008   Kenley   0.000127   girl
        14 1880 Arthur 0.013504    boy   257999 2008   Sloane   0.000127   girl
        15 1880    Fred 0.013251   boy   258000 2008 Elianna    0.000127   girl



Tuesday, 7 July 2009
Brainstorm

                       What variables and summaries might you
                       want to generate from this data? What
                       questions would you like to be able to
                       answer about the data?
                       With your partner, you have 2 minutes to
                       come up with as many as possible.



Tuesday, 7 July 2009
Some of my ideas
                       • First/last letter   • Rank
                       • Length              • Ecdf (how many
                                               babies have a
                       • Number/percent
                                               name in the top
                         of vowels
                                               2, 3, 5, 100 etc)
                       • Biblical names?




Tuesday, 7 July 2009
Transform &
                               summarise
                       transform(df, var1 = expr1, ...)
                       summarise(df, var1 = expr1, ...)


                       Transform modifies an existing data
                       frame. Summarise creates a new data
                       frame.


Tuesday, 7 July 2009
letter <- function(x, n = 1) {     Many interesting
        if (n < 0) {
          nc <- nchar(x)                 transformations and
        }
          n <- nc + n + 1
                                         summaries can be
        tolower(substr(x, n, n))         calculated for the
      }
      vowels <- function(x) {            whole dataset
        nchar(gsub("[^aeiou]", "", x))
      }

      bnames <- transform(bnames,
        first = letter(name, 1),
        last = letter(name, -1),
        length = nchar(name),
        vowels = vowels(name)
      )

      summarise(bnames,
        max_perc = max(percent),
        min_perc = min(percent))

Tuesday, 7 July 2009
Group-wise

                       What about group-wise transformations
                       or summaries? e.g. what if we want to
                       compute the rank of a name within a sex
                       and year?
                       This task is easy if we have a single year
                       & sex, but hard otherwise.



Tuesday, 7 July 2009
one <- subset(bnames, sex == "boy" & year == 2008)
      one$rank <- rank(-one$percent,
        ties.method = "first")

      # or
      one <- transform(one,
        rank = rank(-percent, ties.method = "first"))
      head(one)




                             What if we want to transform
                             every sex and year?
Tuesday, 7 July 2009
# Split
      pieces <- split(bnames,
        list(bnames$sex, bnames$year))

      # Apply
      results <- vector("list", length(pieces))
      for(i in seq_along(pieces)) {
        piece <- pieces[[i]]
        piece <- transform(piece,
          rank = rank(-percent, ties.method = "first"))
        results[[i]] <- piece
      }

      # Combine
      result <- do.call("rbind", results)
Tuesday, 7 July 2009
# Or equivalently

      bnames <- ddply(bnames, c("sex", "year"), transform,
        rank = rank(-percent, ties.method = "first"))




Tuesday, 7 July 2009
Way to split   Function to apply to
                               Input data
                                             up input          each piece
      # Or equivalently

      bnames <- ddply(bnames, c("sex", "year"), transform,
        rank = rank(-percent, ties.method = "first"))

               2nd argument
              to transform()




Tuesday, 7 July 2009
ddply

                       • .data: data frame to process
                       • .variables: combination of variables to
                         split by
                       • .fun: function to call on each piece
                       • ...: extra arguments passed to .fun



Tuesday, 7 July 2009
Variable specification
                              syntax
                       • Character: c("sex", "year")
                       • Numeric: 1:3
                       • Formula: ~ sex + year
                       • Special:
                        • .(sex, year)
                        • .(first = letter(name, 1))


Tuesday, 7 July 2009
Match function with use
                                               randomisation/permutation
                          scale(x)
                                                         tests
                                               scale to [0, 1] within each
                          rank(x)
                                                          group
                                                 scale to mean 0, sd 1
                 x - min(x) / diff(range(x))
                                                   within each group
                                                  compute per-group
                          x / x[1]
                                                      rankings

                        sample(x)                 index a time series
Tuesday, 7 July 2009
Summaries
                       In a similar way, we can use ddply() for
                       group-wise summaries.
                       There are many base R functions for
                       special cases. Where available, these are
                       often much faster; but you have to know
                       they exist, and have to remember how to
                       use them.


Tuesday, 7 July 2009
ddply(bnames, c("name"), summarise,
        tot = sum(percent))
      ddply(bnames, c("length"), summarise,
        tot = sum(percent))
      ddply(bnames, c("year", "sex"), summarise,
        tot = sum(percent))

      fl <- ddply(bnames, c("year", "sex", "first"),
        summarise, tot = sum(percent))
      library(ggplot2)
      qplot(year, tot, data = fl, geom = "line",
        colour = sex, facets = ~ first)



Tuesday, 7 July 2009
Challenge
                       Create a plot that shows (by year) the
                       proportion of US children who have a
                       name in the top 100.
                       Extra challenge: break it down by sex.
                       What does this suggest about baby
                       naming trends in the US?



Tuesday, 7 July 2009
1.0




      0.8




      0.6

                                                                 sex
                                                                       boy
tot




                                                                       girl
      0.4




      0.2




      0.0

              1880     1900   1920   1940   1960   1980   2000
                                     year
Tuesday, 7 July 2009

More Related Content

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Recently uploaded

南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证kbdhl05e
 
E J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptxE J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptxJackieSparrow3
 
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...Authentic No 1 Amil Baba In Pakistan
 
(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)oannq
 
Inspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptxInspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptxShubham Rawat
 
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...JeylaisaManabat1
 

Recently uploaded (6)

南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证南新罕布什尔大学毕业证学位证成绩单-学历认证
南新罕布什尔大学毕业证学位证成绩单-学历认证
 
E J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptxE J Waggoner against Kellogg's Pantheism 8.pptx
E J Waggoner against Kellogg's Pantheism 8.pptx
 
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
Authentic No 1 Amil Baba In Pakistan Amil Baba In Faisalabad Amil Baba In Kar...
 
(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)(南达科他州立大学毕业证学位证成绩单-永久存档)
(南达科他州立大学毕业证学位证成绩单-永久存档)
 
Inspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptxInspiring Through Words Power of Inspiration.pptx
Inspiring Through Words Power of Inspiration.pptx
 
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
Module-2-Lesson-2-COMMUNICATION-AIDS-AND-STRATEGIES-USING-TOOLS-OF-TECHNOLOGY...
 

02 Ddply

  • 1. plyr US baby names Hadley Wickham Tuesday, 7 July 2009
  • 2. 1. Introduction to the data 2. Transformations and summaries 3. Group-wise transformation and summary 4. Variable selection syntax 5. Challenge Tuesday, 7 July 2009
  • 3. Baby names Top 1000 male and female baby names in the US, from 1880 to 2008. 258,000 records (1000 * 2 * 129) But only four variables: year, name, sex and percent. CC BY http://www.flickr.com/photos/the_light_show/2586781132 Tuesday, 7 July 2009
  • 4. > head(bnames, 15) > tail(bnames, 15) year name percent sex year name percent sex 1 1880 John 0.081541 boy 257986 2008 Neveah 0.000130 girl 2 1880 William 0.080511 boy 257987 2008 Amaris 0.000129 girl 3 1880 James 0.050057 boy 257988 2008 Hadassah 0.000129 girl 4 1880 Charles 0.045167 boy 257989 2008 Dania 0.000129 girl 5 1880 George 0.043292 boy 257990 2008 Hailie 0.000129 girl 6 1880 Frank 0.027380 boy 257991 2008 Jamiya 0.000129 girl 7 1880 Joseph 0.022229 boy 257992 2008 Kathy 0.000129 girl 8 1880 Thomas 0.021401 boy 257993 2008 Laylah 0.000129 girl 9 1880 Henry 0.020641 boy 257994 2008 Riya 0.000129 girl 10 1880 Robert 0.020404 boy 257995 2008 Diya 0.000128 girl 11 1880 Edward 0.019965 boy 257996 2008 Carleigh 0.000128 girl 12 1880 Harry 0.018175 boy 257997 2008 Iyana 0.000128 girl 13 1880 Walter 0.014822 boy 257998 2008 Kenley 0.000127 girl 14 1880 Arthur 0.013504 boy 257999 2008 Sloane 0.000127 girl 15 1880 Fred 0.013251 boy 258000 2008 Elianna 0.000127 girl Tuesday, 7 July 2009
  • 5. Brainstorm What variables and summaries might you want to generate from this data? What questions would you like to be able to answer about the data? With your partner, you have 2 minutes to come up with as many as possible. Tuesday, 7 July 2009
  • 6. Some of my ideas • First/last letter • Rank • Length • Ecdf (how many babies have a • Number/percent name in the top of vowels 2, 3, 5, 100 etc) • Biblical names? Tuesday, 7 July 2009
  • 7. Transform & summarise transform(df, var1 = expr1, ...) summarise(df, var1 = expr1, ...) Transform modifies an existing data frame. Summarise creates a new data frame. Tuesday, 7 July 2009
  • 8. letter <- function(x, n = 1) { Many interesting if (n < 0) { nc <- nchar(x) transformations and } n <- nc + n + 1 summaries can be tolower(substr(x, n, n)) calculated for the } vowels <- function(x) { whole dataset nchar(gsub("[^aeiou]", "", x)) } bnames <- transform(bnames, first = letter(name, 1), last = letter(name, -1), length = nchar(name), vowels = vowels(name) ) summarise(bnames, max_perc = max(percent), min_perc = min(percent)) Tuesday, 7 July 2009
  • 9. Group-wise What about group-wise transformations or summaries? e.g. what if we want to compute the rank of a name within a sex and year? This task is easy if we have a single year & sex, but hard otherwise. Tuesday, 7 July 2009
  • 10. one <- subset(bnames, sex == "boy" & year == 2008) one$rank <- rank(-one$percent, ties.method = "first") # or one <- transform(one, rank = rank(-percent, ties.method = "first")) head(one) What if we want to transform every sex and year? Tuesday, 7 July 2009
  • 11. # Split pieces <- split(bnames, list(bnames$sex, bnames$year)) # Apply results <- vector("list", length(pieces)) for(i in seq_along(pieces)) { piece <- pieces[[i]] piece <- transform(piece, rank = rank(-percent, ties.method = "first")) results[[i]] <- piece } # Combine result <- do.call("rbind", results) Tuesday, 7 July 2009
  • 12. # Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first")) Tuesday, 7 July 2009
  • 13. Way to split Function to apply to Input data up input each piece # Or equivalently bnames <- ddply(bnames, c("sex", "year"), transform, rank = rank(-percent, ties.method = "first")) 2nd argument to transform() Tuesday, 7 July 2009
  • 14. ddply • .data: data frame to process • .variables: combination of variables to split by • .fun: function to call on each piece • ...: extra arguments passed to .fun Tuesday, 7 July 2009
  • 15. Variable specification syntax • Character: c("sex", "year") • Numeric: 1:3 • Formula: ~ sex + year • Special: • .(sex, year) • .(first = letter(name, 1)) Tuesday, 7 July 2009
  • 16. Match function with use randomisation/permutation scale(x) tests scale to [0, 1] within each rank(x) group scale to mean 0, sd 1 x - min(x) / diff(range(x)) within each group compute per-group x / x[1] rankings sample(x) index a time series Tuesday, 7 July 2009
  • 17. Summaries In a similar way, we can use ddply() for group-wise summaries. There are many base R functions for special cases. Where available, these are often much faster; but you have to know they exist, and have to remember how to use them. Tuesday, 7 July 2009
  • 18. ddply(bnames, c("name"), summarise, tot = sum(percent)) ddply(bnames, c("length"), summarise, tot = sum(percent)) ddply(bnames, c("year", "sex"), summarise, tot = sum(percent)) fl <- ddply(bnames, c("year", "sex", "first"), summarise, tot = sum(percent)) library(ggplot2) qplot(year, tot, data = fl, geom = "line", colour = sex, facets = ~ first) Tuesday, 7 July 2009
  • 19. Challenge Create a plot that shows (by year) the proportion of US children who have a name in the top 100. Extra challenge: break it down by sex. What does this suggest about baby naming trends in the US? Tuesday, 7 July 2009
  • 20. 1.0 0.8 0.6 sex boy tot girl 0.4 0.2 0.0 1880 1900 1920 1940 1960 1980 2000 year Tuesday, 7 July 2009