SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Stat405Visualising time & space


                            Hadley Wickham
Thursday, 14 October 2010
1. New data: baby names by state
                2. Visualise time (done!)
                3. Visualise time conditional on space
                4. Visualise space
                5. Visualise space conditional on time
                6. Aside: geographic data


Thursday, 14 October 2010
Project
                    Project 2 due November 4.
                    Basically same as project 2, but will be
                    using the full play-by-play data from the
                    08/09 NBA season.
                    I expect to see lots of ddply usage, and
                    more advanced graphics (next week).



Thursday, 14 October 2010
Baby names by state
                    Top 100 male and female baby
                    names for each state, 1960–2008.
                    480,000 records
                    (100 * 50 * 2 * 48)
                    Slightly different variables: state,
                    year, name, sex and number.

                                          CC BY http://www.flickr.com/photos/the_light_show/2586781132
Thursday, 14 October 2010
Subset

                    Easier to compare states if we have
                    proportions. To calculate proportions,
                    need births. Could only find data from
                    1981.
                    Selected 30 names that occurred fairly
                    frequently, and had interesting patterns.



Thursday, 14 October 2010
Aaron Alex Allison Alyssa Angela Ashley
                    Carlos Chelsea Christian Eric Evan
                    Gabriel Jacob Jared Jennifer Jonathan
                    Juan Katherine Kelsey Kevin Matthew
                    Michelle Natalie Nicholas Noah Rebecca
                    Sara Sarah Taylor Thomas




Thursday, 14 October 2010
Getting started

                library(ggplot2)
                library(plyr)

                bnames <- read.csv("interesting-names.csv",
                  stringsAsFactors = F)

                matthew <- subset(bnames, name == "Matthew")




Thursday, 14 October 2010
Time |
                            Space
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




                            1985   1990          1995   2000   2005
                                          year
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




                            1985   1990   1995   2000      2005
qplot(year, prop, data = matthew,year
                                  geom = "line", group = state)
Thursday, 14 October 2010
AK       AL         AR         AZ          CA        CO         CT         DC
        0.04
        0.03
        0.02
        0.01
                     DE       FL         GA         HI          IA        ID         IL         IN
        0.04
        0.03
        0.02
        0.01
                     KS       KY         LA         MA          MD        ME         MI         MN
        0.04
        0.03
        0.02
        0.01
                    MO       MS          MT         NC          NE        NH         NJ         NM
        0.04
 prop




        0.03
        0.02
        0.01
                     NV       NY         OH         OK          OR        PA         RI         SC
        0.04
        0.03
        0.02
        0.01
                     SD       TN         TX         UT          VA        VT         WA         WI
        0.04
        0.03
        0.02
        0.01
                    WV       WY
        0.04
        0.03
        0.02
        0.01
               198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000
                 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005
                                                         year
Thursday, 14 October 2010
AK       AL         AR         AZ         CA         CO         CT         DC
        0.04
        0.03
        0.02
        0.01
                     DE       FL         GA         HI         IA         ID         IL         IN
        0.04
        0.03
        0.02
        0.01
                     KS       KY         LA         MA         MD         ME         MI         MN
        0.04
        0.03
        0.02
        0.01
                    MO       MS          MT         NC         NE         NH         NJ         NM
        0.04
 prop




        0.03
        0.02
        0.01
                     NV       NY         OH         OK         OR         PA         RI         SC
        0.04
        0.03
        0.02
        0.01
                     SD       TN         TX         UT         VA         VT         WA         WI
        0.04
        0.03
        0.02
        0.01
                    WV       WY
        0.04
        0.03
        0.02
        0.01
               198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000
                 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005

last_plot() + facet_wrap(~ state)year
Thursday, 14 October 2010
Your turn

                    Ensure that you can re-create these plots
                    for other names. What do you see?
                    Can you write a function that plots the
                    trend for a given name?




Thursday, 14 October 2010
show_name <- function(name) {
       name <- bnames[bnames$name == name, ]
       qplot(year, prop, data = name, geom = "line",
         group = state)
     }

     show_name("Jessica")
     show_name("Aaron")
     show_name("Juan") + facet_wrap(~ state)




Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




                            1985   1990          1995   2000   2005
                                          year
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01




qplot(year, prop, data = matthew, geom1995 "line", 2000
                1985       1990         =          group = state) +
                                                              2005
  geom_smooth(aes(group = 1), se year size = 3)
                                 = F,
Thursday, 14 October 2010
0.04




        0.03
 prop




        0.02




        0.01


                         So we only get one smooth
                             for the whole dataset
qplot(year, prop, data = matthew, geom1995 "line", 2000
                1985       1990          =         group = state) +
                                                              2005
  geom_smooth(aes(group = 1), se year size = 3)
                                   = F,
Thursday, 14 October 2010
Three useful tools
                    Smoothing: can be easier to perceive
                    overall trend by smoothing individual
                    functions
                    Centering: remove differences in center
                    by subtracting mean
                    Scaling: remove differences in range by
                    dividing by sd, or by scaling to [0, 1]


Thursday, 14 October 2010
library(mgcv)
     smooth <- function(y, x, amount = 0.1) {
       mod <- gam(y ~ s(x, bs = "cr"), sp = amount)
       as.numeric(predict(mod))
     }

     matthew <- ddply(matthew, "state", transform,
       prop_s1 = smooth(prop, year, amount = 0.01),
       prop_s2 = smooth(prop, year, amount = 0.1),
       prop_s3 = smooth(prop, year, amount = 1),
       prop_s4 = smooth(prop, year, amount = 10))

     qplot(year, prop_s1, data = matthew, geom = "line",
       group = state)

Thursday, 14 October 2010
center <- function(x) x - mean(x, na.rm = T)

     matthew <- ddply(matthew, "state", transform,
       prop_c = center(prop),
       prop_sc = center(prop_s1))

     qplot(year, prop_c, data = matthew, geom = "line",
       group = state)
     qplot(year, prop_sc, data = matthew, geom = "line",
       group = state)




Thursday, 14 October 2010
scale <- function(x) x / sd(x, na.rm = T)
     scale01 <- function(x) {
       rng <- range(x, na.rm = T)
       (x - rng[1]) / (rng[2] - rng[1])
     }

     matthew <- ddply(matthew, "state", transform,
       prop_ss = scale01(prop_s1))

     qplot(year, prop_ss, data = matthew, geom = "line",
       group = state)



Thursday, 14 October 2010
Your turn
                    Create a plot to show all names
                    simultaneously. Does smoothing every
                    name in every state make it easier to see
                    patterns?
                    Hint: run the following R code on the next
                    slide to eliminate names with less than 10
                    years of data


Thursday, 14 October 2010
longterm <- ddply(bnames, c("name", "state"),
     function(df) {
        if (nrow(df) > 10) {
          df
        }
     })




Thursday, 14 October 2010
qplot(year, prop, data = bnames, geom = "line",
       group = state, alpha = I(1 / 4)) +
       facet_wrap(~ name)

     longterm <- ddply(longterm, c("name", "state"),
       transform, prop_s = smooth(prop, year))

     qplot(year, prop_s, data = longterm, geom = "line",
       group = state, alpha = I(1 / 4)) +
       facet_wrap(~ name)
     last_plot() + facet_wrap(~ name, scales = "free_y")



Thursday, 14 October 2010
Space

Thursday, 14 October 2010
Spatial plots

                    Choropleth map:
                    map colour of areas to value.
                    Proportional symbol map:
                    map size of symbols to value




Thursday, 14 October 2010
juan2000 <- subset(bnames, name == "Juan" &
       year == 2000)

     # Turn map data into normal data frame
     library(maps)
     states <- map_data("state")
     states$state <- state.abb[match(states$region,
       tolower(state.name))]

     # Join datasets
     choropleth <- join(states, juan2000, by = "state")

     # Plot with polygons
     qplot(long, lat, data = choropleth, geom = "polygon",
       fill = prop, group = group)

Thursday, 14 October 2010
45




       40
                                                                  prop
                                                                         0.004
                                                                         0.006
 lat




                                                                         0.008
       35
                                                                          0.01




       30




                        −120   −110   −100      −90   −80   −70
                                         long
Thursday, 14 October 2010
What’s the problem
                                                    with this map?
       45                                       How could we fix it?


       40
                                                                      prop
                                                                             0.004
                                                                             0.006
 lat




                                                                             0.008
       35
                                                                              0.01




       30




                        −120   −110   −100      −90    −80    −70
                                         long
Thursday, 14 October 2010
ggplot(choropleth, aes(long, lat, group = group)) +
       geom_polygon(fill = "white", colour = "grey50") +
       geom_polygon(aes(fill = prop))




Thursday, 14 October 2010
45




       40
                                                                  prop
                                                                         0.004
                                                                         0.006
 lat




                                                                         0.008
       35
                                                                          0.01




       30




                        −120   −110   −100      −90   −80   −70
                                         long
Thursday, 14 October 2010
Problems?

                    What are the problems with this sort of
                    plot?
                    Take one minute to brainstorm some
                    possible issues.




Thursday, 14 October 2010
Problems
                    Big areas most striking. But in the US (as
                    with most countries) big areas tend to
                    least populated. Most populated areas
                    tend to be small and dense - e.g. the East
                    coast.
                    (Another computational problem: need to
                    push around a lot of data to create these
                    plots)


Thursday, 14 October 2010
●




       45
                            ●

                                                                                    ●




                                                                                            ●
                                                     ●




       40                                                        ●
                                                                                        ●



                                               ●                                                      prop
                                    ●                    ●
                                                                                                       ●     0.004
                                ●                                                                     ●      0.006
 lat




                                                         ●                     ●
                                                                                                      ●      0.008
       35
                                        ●      ●                                                      ●      0.010

                                                                          ●




                                                   ●
       30

                                                                      ●




                        −120            −110       −100         −90           −80               −70
                                                         long
Thursday, 14 October 2010
mid_range <- function(x) mean(range(x))
     centres <- ddply(states, c("state"), summarise,
       lat = mid_range(lat), long = mid_range(long))

     bubble <- join(juan2000, centres, by = "state")
     qplot(long, lat, data = bubble,
       size = prop, colour = prop)

     ggplot(bubble, aes(long, lat)) +
       geom_polygon(aes(group = group), data = states,
         fill = NA, colour = "grey50") +
       geom_point(aes(size = prop, colour = prop))


Thursday, 14 October 2010
Your turn


                    Replicate either a choropleth or a
                    proportional symbol map with the name
                    of your choice.




Thursday, 14 October 2010
Space |
                             Time
Thursday, 14 October 2010
Thursday, 14 October 2010
Your turn


                    Try and create this plot yourself. What is
                    the main difference between this plot and
                    the previous?




Thursday, 14 October 2010
juan <- subset(bnames, name == "Juan")
     bubble <- merge(juan, centres, by = "state")

     ggplot(bubble, aes(long, lat)) +
       geom_polygon(aes(group = group), data = states,
         fill = NA, colour = "grey80") +
       geom_point(aes(colour = prop)) +
       facet_wrap(~ year)




Thursday, 14 October 2010
Aside: geographic data

                    Boundaries for most countries available
                    from: http://gadm.org
                    To use with ggplot2, use the fortify
                    function to convert to usual data frame.
                    Will also need to install the sp package.



Thursday, 14 October 2010
# install.packages("sp")

     library(sp)
     load(url("http://gadm.org/data/rda/CHE_adm1.RData"))

     head(as.data.frame(gadm))
     ch <- fortify(gadm, region = "ID_1")
     str(ch)

     qplot(long, lat, group = group, data = ch,
       geom = "polygon", colour = I("white"))



Thursday, 14 October 2010
Thursday, 14 October 2010
This work is licensed under the Creative
       Commons Attribution-Noncommercial 3.0 United
       States License. To view a copy of this license,
       visit http://creativecommons.org/licenses/by-nc/
       3.0/us/ or send a letter to Creative Commons,
       171 Second Street, Suite 300, San Francisco,
       California, 94105, USA.




Thursday, 14 October 2010

Contenu connexe

Plus de Hadley Wickham (20)

23 data-structures
23 data-structures23 data-structures
23 data-structures
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
19 tables
19 tables19 tables
19 tables
 
17 polishing
17 polishing17 polishing
17 polishing
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
06 data
06 data06 data
06 data
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
04 reports
04 reports04 reports
04 reports
 
03 extensions
03 extensions03 extensions
03 extensions
 

Dernier

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

15 time-space

  • 1. Stat405Visualising time & space Hadley Wickham Thursday, 14 October 2010
  • 2. 1. New data: baby names by state 2. Visualise time (done!) 3. Visualise time conditional on space 4. Visualise space 5. Visualise space conditional on time 6. Aside: geographic data Thursday, 14 October 2010
  • 3. Project Project 2 due November 4. Basically same as project 2, but will be using the full play-by-play data from the 08/09 NBA season. I expect to see lots of ddply usage, and more advanced graphics (next week). Thursday, 14 October 2010
  • 4. Baby names by state Top 100 male and female baby names for each state, 1960–2008. 480,000 records (100 * 50 * 2 * 48) Slightly different variables: state, year, name, sex and number. CC BY http://www.flickr.com/photos/the_light_show/2586781132 Thursday, 14 October 2010
  • 5. Subset Easier to compare states if we have proportions. To calculate proportions, need births. Could only find data from 1981. Selected 30 names that occurred fairly frequently, and had interesting patterns. Thursday, 14 October 2010
  • 6. Aaron Alex Allison Alyssa Angela Ashley Carlos Chelsea Christian Eric Evan Gabriel Jacob Jared Jennifer Jonathan Juan Katherine Kelsey Kevin Matthew Michelle Natalie Nicholas Noah Rebecca Sara Sarah Taylor Thomas Thursday, 14 October 2010
  • 7. Getting started library(ggplot2) library(plyr) bnames <- read.csv("interesting-names.csv", stringsAsFactors = F) matthew <- subset(bnames, name == "Matthew") Thursday, 14 October 2010
  • 8. Time | Space Thursday, 14 October 2010
  • 9. 0.04 0.03 prop 0.02 0.01 1985 1990 1995 2000 2005 year Thursday, 14 October 2010
  • 10. 0.04 0.03 prop 0.02 0.01 1985 1990 1995 2000 2005 qplot(year, prop, data = matthew,year geom = "line", group = state) Thursday, 14 October 2010
  • 11. AK AL AR AZ CA CO CT DC 0.04 0.03 0.02 0.01 DE FL GA HI IA ID IL IN 0.04 0.03 0.02 0.01 KS KY LA MA MD ME MI MN 0.04 0.03 0.02 0.01 MO MS MT NC NE NH NJ NM 0.04 prop 0.03 0.02 0.01 NV NY OH OK OR PA RI SC 0.04 0.03 0.02 0.01 SD TN TX UT VA VT WA WI 0.04 0.03 0.02 0.01 WV WY 0.04 0.03 0.02 0.01 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 year Thursday, 14 October 2010
  • 12. AK AL AR AZ CA CO CT DC 0.04 0.03 0.02 0.01 DE FL GA HI IA ID IL IN 0.04 0.03 0.02 0.01 KS KY LA MA MD ME MI MN 0.04 0.03 0.02 0.01 MO MS MT NC NE NH NJ NM 0.04 prop 0.03 0.02 0.01 NV NY OH OK OR PA RI SC 0.04 0.03 0.02 0.01 SD TN TX UT VA VT WA WI 0.04 0.03 0.02 0.01 WV WY 0.04 0.03 0.02 0.01 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 19902000 198519952005 19902000 198519952005 19902000 198519952005 19902000 198519952005 last_plot() + facet_wrap(~ state)year Thursday, 14 October 2010
  • 13. Your turn Ensure that you can re-create these plots for other names. What do you see? Can you write a function that plots the trend for a given name? Thursday, 14 October 2010
  • 14. show_name <- function(name) { name <- bnames[bnames$name == name, ] qplot(year, prop, data = name, geom = "line", group = state) } show_name("Jessica") show_name("Aaron") show_name("Juan") + facet_wrap(~ state) Thursday, 14 October 2010
  • 15. 0.04 0.03 prop 0.02 0.01 1985 1990 1995 2000 2005 year Thursday, 14 October 2010
  • 16. 0.04 0.03 prop 0.02 0.01 qplot(year, prop, data = matthew, geom1995 "line", 2000 1985 1990 = group = state) + 2005 geom_smooth(aes(group = 1), se year size = 3) = F, Thursday, 14 October 2010
  • 17. 0.04 0.03 prop 0.02 0.01 So we only get one smooth for the whole dataset qplot(year, prop, data = matthew, geom1995 "line", 2000 1985 1990 = group = state) + 2005 geom_smooth(aes(group = 1), se year size = 3) = F, Thursday, 14 October 2010
  • 18. Three useful tools Smoothing: can be easier to perceive overall trend by smoothing individual functions Centering: remove differences in center by subtracting mean Scaling: remove differences in range by dividing by sd, or by scaling to [0, 1] Thursday, 14 October 2010
  • 19. library(mgcv) smooth <- function(y, x, amount = 0.1) { mod <- gam(y ~ s(x, bs = "cr"), sp = amount) as.numeric(predict(mod)) } matthew <- ddply(matthew, "state", transform, prop_s1 = smooth(prop, year, amount = 0.01), prop_s2 = smooth(prop, year, amount = 0.1), prop_s3 = smooth(prop, year, amount = 1), prop_s4 = smooth(prop, year, amount = 10)) qplot(year, prop_s1, data = matthew, geom = "line", group = state) Thursday, 14 October 2010
  • 20. center <- function(x) x - mean(x, na.rm = T) matthew <- ddply(matthew, "state", transform, prop_c = center(prop), prop_sc = center(prop_s1)) qplot(year, prop_c, data = matthew, geom = "line", group = state) qplot(year, prop_sc, data = matthew, geom = "line", group = state) Thursday, 14 October 2010
  • 21. scale <- function(x) x / sd(x, na.rm = T) scale01 <- function(x) { rng <- range(x, na.rm = T) (x - rng[1]) / (rng[2] - rng[1]) } matthew <- ddply(matthew, "state", transform, prop_ss = scale01(prop_s1)) qplot(year, prop_ss, data = matthew, geom = "line", group = state) Thursday, 14 October 2010
  • 22. Your turn Create a plot to show all names simultaneously. Does smoothing every name in every state make it easier to see patterns? Hint: run the following R code on the next slide to eliminate names with less than 10 years of data Thursday, 14 October 2010
  • 23. longterm <- ddply(bnames, c("name", "state"), function(df) { if (nrow(df) > 10) { df } }) Thursday, 14 October 2010
  • 24. qplot(year, prop, data = bnames, geom = "line", group = state, alpha = I(1 / 4)) + facet_wrap(~ name) longterm <- ddply(longterm, c("name", "state"), transform, prop_s = smooth(prop, year)) qplot(year, prop_s, data = longterm, geom = "line", group = state, alpha = I(1 / 4)) + facet_wrap(~ name) last_plot() + facet_wrap(~ name, scales = "free_y") Thursday, 14 October 2010
  • 26. Spatial plots Choropleth map: map colour of areas to value. Proportional symbol map: map size of symbols to value Thursday, 14 October 2010
  • 27. juan2000 <- subset(bnames, name == "Juan" & year == 2000) # Turn map data into normal data frame library(maps) states <- map_data("state") states$state <- state.abb[match(states$region, tolower(state.name))] # Join datasets choropleth <- join(states, juan2000, by = "state") # Plot with polygons qplot(long, lat, data = choropleth, geom = "polygon", fill = prop, group = group) Thursday, 14 October 2010
  • 28. 45 40 prop 0.004 0.006 lat 0.008 35 0.01 30 −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 29. What’s the problem with this map? 45 How could we fix it? 40 prop 0.004 0.006 lat 0.008 35 0.01 30 −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 30. ggplot(choropleth, aes(long, lat, group = group)) + geom_polygon(fill = "white", colour = "grey50") + geom_polygon(aes(fill = prop)) Thursday, 14 October 2010
  • 31. 45 40 prop 0.004 0.006 lat 0.008 35 0.01 30 −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 32. Problems? What are the problems with this sort of plot? Take one minute to brainstorm some possible issues. Thursday, 14 October 2010
  • 33. Problems Big areas most striking. But in the US (as with most countries) big areas tend to least populated. Most populated areas tend to be small and dense - e.g. the East coast. (Another computational problem: need to push around a lot of data to create these plots) Thursday, 14 October 2010
  • 34. 45 ● ● ● ● 40 ● ● ● prop ● ● ● 0.004 ● ● 0.006 lat ● ● ● 0.008 35 ● ● ● 0.010 ● ● 30 ● −120 −110 −100 −90 −80 −70 long Thursday, 14 October 2010
  • 35. mid_range <- function(x) mean(range(x)) centres <- ddply(states, c("state"), summarise, lat = mid_range(lat), long = mid_range(long)) bubble <- join(juan2000, centres, by = "state") qplot(long, lat, data = bubble, size = prop, colour = prop) ggplot(bubble, aes(long, lat)) + geom_polygon(aes(group = group), data = states, fill = NA, colour = "grey50") + geom_point(aes(size = prop, colour = prop)) Thursday, 14 October 2010
  • 36. Your turn Replicate either a choropleth or a proportional symbol map with the name of your choice. Thursday, 14 October 2010
  • 37. Space | Time Thursday, 14 October 2010
  • 39. Your turn Try and create this plot yourself. What is the main difference between this plot and the previous? Thursday, 14 October 2010
  • 40. juan <- subset(bnames, name == "Juan") bubble <- merge(juan, centres, by = "state") ggplot(bubble, aes(long, lat)) + geom_polygon(aes(group = group), data = states, fill = NA, colour = "grey80") + geom_point(aes(colour = prop)) + facet_wrap(~ year) Thursday, 14 October 2010
  • 41. Aside: geographic data Boundaries for most countries available from: http://gadm.org To use with ggplot2, use the fortify function to convert to usual data frame. Will also need to install the sp package. Thursday, 14 October 2010
  • 42. # install.packages("sp") library(sp) load(url("http://gadm.org/data/rda/CHE_adm1.RData")) head(as.data.frame(gadm)) ch <- fortify(gadm, region = "ID_1") str(ch) qplot(long, lat, group = group, data = ch, geom = "polygon", colour = I("white")) Thursday, 14 October 2010
  • 44. This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Thursday, 14 October 2010