SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
plyr
                       Modelling large data


                          Hadley Wickham
Tuesday, 7 July 2009
1. Strategy for analysing large data.
                2. Introduction to the Texas housing
                   data.
                3. What’s happening in Houston?
                4. Using a models as a tool
                5. Using models in their own right


Tuesday, 7 July 2009
Large data strategy
                       Start with a single unit, and identify
                       interesting patterns.
                       Summarise patterns with a model.
                       Apply model to all units.
                       Look for units that don’t fit the pattern.
                       Summarise with a single model.


Tuesday, 7 July 2009
Texas housing data
                       For each metropolitan area (45) in Texas,
                       for each month from 2000 to 2009 (112):
                       Number of houses listed and sold
                       Total value of houses, and average sale
                       price
                       Average time on market


CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/
Tuesday, 7 July 2009
Strategy

                       Start with a single city (Houston).
                       Explore patterns & fit models.
                       Apply models to all cities.




Tuesday, 7 July 2009
220000




            200000
 avgprice




            180000




            160000




                       2000   2002   2004    2006   2008
                                      date




Tuesday, 7 July 2009
8000


         7000


         6000
 sales




         5000


         4000


         3000


                  2000   2002   2004     2006   2008
                                  date




Tuesday, 7 July 2009
6.5


            6.0
 onmarket




            5.5


            5.0


            4.5


            4.0

                  2000   2002   2004      2006   2008
                                   date




Tuesday, 7 July 2009
Seasonal trends

                       Make it much harder to see long term
                       trend. How can we remove the trend?
                       (Many sophisticated techniques from time
                       series, but what’s the simplest thing that
                       might work?)




Tuesday, 7 July 2009
220000




            200000
 avgprice




            180000




            160000




                       2   4     6     8   10   12
                               month
Tuesday, 7 July 2009
Challenge
                       What does the following function do?
                       deseas <- function(var, month) {
                         resid(lm(var ~ factor(month))) +
                           mean(var, na.rm = TRUE)
                       }
                       How could you use it in conjunction with
                       transform to deasonalise the data? What if
                       you wanted to deasonalise every city?


Tuesday, 7 July 2009
houston <- transform(houston,
        avgprice_ds = deseas(avgprice, month),
        listings_ds = deseas(listings, month),
        sales_ds = deseas(sales, month),
        onmarket_ds = deseas(onmarket, month)
      )

      qplot(month, sales_ds, data = houston,
        geom = "line", group = year) + avg




Tuesday, 7 July 2009
210000



               200000



               190000
 avgprice_ds




               180000



               170000



               160000



               150000



                        2   4     6     8   10   12
                                month
Tuesday, 7 July 2009
Model as tools
                       Here we’re using the linear model as a
                       tool - we don’t care about the coefficients
                       or the standard errors, just using it to get
                       rid of a striking pattern.
                       Tukey described this pattern as residuals
                       and reiteration: by removing a striking
                       pattern we can see more subtle patterns.


Tuesday, 7 July 2009
210000


               200000


               190000
 avgprice_ds




               180000


               170000


               160000


               150000


                        2000   2002   2004    2006   2008
                                       date




Tuesday, 7 July 2009
7000


            6500


            6000
 sales_ds




            5500


            5000


            4500


            4000

                   2000   2002   2004     2006   2008
                                   date




Tuesday, 7 July 2009
6.5



               6.0
 onmarket_ds




               5.5



               5.0



               4.5



               4.0

                     2000   2002   2004      2006   2008
                                      date




Tuesday, 7 July 2009
Summary

                       Most variables seem to be combination of
                       strong seasonal pattern plus weaker long-
                       term trend.
                       How do these patterns hold up for the
                       rest of Texas? We’ll focus on sales.




Tuesday, 7 July 2009
8000



         6000
 sales




         4000



         2000




                  2000   2002   2004     2006   2008
                                  date




Tuesday, 7 July 2009
tx <- read.csv("tx-house-sales.csv")
      qplot(date, sales, data = tx, geom = "line",
        group = city)

      tx <- ddply(tx, "city", transform,
        sales_ds = deseas(sales, month))
      qplot(date, sales_ds, data = tx, geom = "line",
        group = city)




Tuesday, 7 July 2009
7000

            6000

            5000
 sales_ds




            4000

            3000

            2000

            1000




                   2000   2002   2004     2006   2008
                                   date




Tuesday, 7 July 2009
It works, but...
                       It doesn’t give us any insight into the
                       similarity of the patterns across multiple
                       cities. Are the trends the same or
                       different?
                       So instead of throwing the models away
                       and just using the residuals, let’s keep the
                       models and explore them in more depth.


Tuesday, 7 July 2009
Two new tools
                       dlply: takes a data frame, splits up in the
                       same way as ddply, applies function to
                       each piece and combines the results into a
                       list
                       ldply: takes a list, splits up into elements,
                       applies function to each piece and then
                       combines the results into a data frame
                       dlply + ldply = ddply


Tuesday, 7 July 2009
models <- dlply(tx, "city", function(df)
        lm(sales ~ factor(month), data = df))


      models[[1]]
      coef(models[[1]])

      ldply(models, coef)




Tuesday, 7 July 2009
Labelling

                       Notice we didn’t have to do anything to
                       have the coefficients labelled correctly.
                       Behind the scenes plyr records the labels
                       used for the split step, and ensures they
                       are preserved across multiple plyr calls.




Tuesday, 7 July 2009
Back to the model

                       What are some problems with this model?
                       How could you fix them?
                       Is the format of the coefficients optimal?
                       Turn to the person next to you and
                       discuss for 2 minutes.



Tuesday, 7 July 2009
qplot(date, log10(sales), data = tx, geom = "line",
      group = city)

      models2 <- dlply(tx, "city", function(df)
        lm(log10(sales) ~ factor(month), data = df))

      coef2 <- ldply(models2, function(mod) {
         data.frame(
           month = 1:12,
           effect = c(0, coef(mod)[-1]),
           intercept = coef(mod)[1])
      })


Tuesday, 7 July 2009
qplot(date, log10(sales), data = tx, geom = "line",
      group = city)
                     Log transform sales to
                       make coefficients
      models2 <- dlply(tx, "city", function(df)
                      comparable (ratios)
            lm(log10(sales) ~ factor(month), data = df))

      coef2 <- ldply(models2, function(mod) {
         data.frame(
           month = 1:12,
           effect = c(0, coef(mod)[-1]),
           intercept = coef(mod)[1])
      })
                        Puts coefficients in
                       rows, so they can be
                        plotted more easily
Tuesday, 7 July 2009
0.4




           0.3




           0.2
 effect




           0.1




           0.0




          −0.1




                       2     4           6            8           10   12
qplot(month, effect, data = coef2, group month
                                         = city, geom = "line")
Tuesday, 7 July 2009
2.5




             2.0
 10^effect




             1.5




             1.0




                       2    4            6           8           10    12
                                         month
qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line")
Tuesday, 7 July 2009
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
             2.5
             2.0
             1.5
             1.0
                      BrownsvilleBryan−College Station
                                                     Collin County     Corpus Christi    Dallas     Denton County      El Paso
             2.5
             2.0
             1.5
             1.0
                      Fort Bend        Fort Worth        Galveston        Garland       Harlingen     Houston           Irving
             2.5
             2.0
             1.5
             1.0
                   Killeen−Fort Hood     Laredo      Longview−Marshall   Lubbock         Lufkin        McAllen         Midland
 10^effect




             2.5
             2.0
             1.5
             1.0
               Montgomery CountyNacogdoches NE Tarrant County             Odessa        Palestine       Paris         Port Arthur
             2.5
             2.0
             1.5
             1.0
                      San Angelo       San Antonio      San Marcos Sherman−DenisonTemple−Belton       Texarkana          Tyler
             2.5
             2.0
             1.5
             1.0
                       Victoria          Waco          Wichita Falls
             2.5
             2.0
             1.5
             1.0
                     2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012
                                         month
qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city)
Tuesday, 7 July 2009
What should
                               we do next?

                       What do you think?
                       You have 30 seconds to come up with (at
                       least) one idea.




Tuesday, 7 July 2009
My ideas

                       Fit a single model, log(sales) ~ city *
                       factor(month), and look at residuals
                       Fit individual models, log(sales) ~
                       factor(month) + ns(date, 3), look cities
                       that don’t fit



Tuesday, 7 July 2009
# One approach - fit a single model

      mod <- lm(log10(sales) ~ city + factor(month),
        data = tx)

      tx$sales2 <- 10 ^ resid(mod)
      qplot(date, sales2, data = tx, geom = "line",
        group = city)
      last_plot() + facet_wrap(~ city)




Tuesday, 7 July 2009
3.5



          3.0



          2.5
 sales2




          2.0



          1.5



          1.0



          0.5




                2000   2002           2004           2006     2008
                                         date
qplot(date, sales2, data = tx, geom = "line", group = city)
Tuesday, 7 July 2009
Abilene          Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                   BrownsvilleBryan−College Station
                                                  Collin County     Corpus Christi    Dallas     Denton County      El Paso
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                   Fort Bend        Fort Worth        Galveston        Garland       Harlingen     Houston           Irving
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                Killeen−Fort Hood     Laredo      Longview−Marshall   Lubbock         Lufkin        McAllen         Midland
          3.5
 sales2




          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
            Montgomery CountyNacogdoches NE Tarrant County             Odessa        Palestine       Paris         Port Arthur
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                   San Angelo       San Antonio      San Marcos Sherman−DenisonTemple−Belton       Texarkana          Tyler
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
                    Victoria          Waco          Wichita Falls
          3.5
          3.0
          2.5
          2.0
          1.5
          1.0
          0.5
            2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
              2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006
                                                                      date
last_plot() + facet_wrap(~ city)
Tuesday, 7 July 2009
#    Another approach: Essence of most cities is seasonal
      #    term plus long term smooth trend. We could fit this
      #    model to each city, and then look for models which don't
      #    fit well.

      library(splines)
      models3 <- dlply(tx, "city", function(df) {
         lm(log10(sales) ~ factor(month) + ns(date, 3), data = df)
      })

      # Extract rsquared from each model
      rsq <- function(mod) c(rsq = summary(mod)$r.squared)
      quality <- ldply(models3, rsq)




Tuesday, 7 July 2009
Wichita Falls                                                                   ●
                                                 Waco                                                                                ●
                                               Victoria                                   ●
                                                  Tyler                                                                                      ●
                                           Texarkana                                  ●
                                      Temple−Belton                                                                              ●
                                 Sherman−Denison                                                                    ●
                                         San Marcos                                                         ●
                                         San Antonio                                                                                                     ●
                                         San Angelo                                                     ●
                                          Port Arthur     ●
                                                  Paris                           ●
                                             Palestine                  ●
                                              Odessa                                                            ●
                                  NE Tarrant County                                                                                          ●
                                       Nacogdoches                                            ●
                                Montgomery County                                                                                                     ●
                                              Midland                                                                           ●
                                              McAllen                                                                   ●
                                                Lufkin                                    ●
                                             Lubbock                                                                             ●
                                 Longview−Marshall                                                                               ●
                                               Laredo                                                                           ●
                       city




                                  Killeen−Fort Hood                                                                         ●
                                                 Irving                                                         ●
                                              Houston                                                                                                ●
                                            Harlingen                                                                       ●
                                              Garland                                                                           ●
                                           Galveston                                                            ●
                                          Fort Worth                                                                                             ●
                                           Fort Bend                                                                                             ●
                                              El Paso               ●
                                      Denton County                                                                                              ●
                                                Dallas                                                                                            ●
                                       Corpus Christi                                                                                    ●
                                        Collin County                                                                                            ●
                              Bryan−College Station                                                                                               ●
                                          Brownsville                                                                   ●
                                    Brazoria County                                                                     ●
                                           Beaumont                                       ●
                                             Bay Area                                                                                    ●
                                                Austin                                                                                           ●
                                             Arlington                                                                                   ●
                                              Amarillo                                             ●
                                               Abilene                                                                      ●



                                                              0.5             0.6                 0.7               0.8              0.9
                                                                            rsq
qplot(rsq, city, data = quality)
Tuesday, 7 July 2009
San Antonio                                                                                                ●
                                              Montgomery County                                                                                                  ●
                                                            Houston         How are the good                                                                    ●
                                                              Dallas                                                                                           ●
                                            Bryan−College Station
                                                      Collin County
                                                                            fits different from                                                               ●
                                                                                                                                                               ●

                                                    Denton County
                                                              Austin
                                                                              the bad fits?                                                                   ●
                                                                                                                                                             ●
                                                         Fort Bend                                                                                           ●
                                                        Fort Worth                                                                                           ●
                                                                Tyler                                                                                       ●
                                                NE Tarrant County                                                                                          ●
                                                           Bay Area                                                                                    ●
                                                     Corpus Christi                                                                                    ●
                                                           Arlington                                                                                   ●
                                                               Waco                                                                                ●
                                                    Temple−Belton                                                                              ●
                                                           Lubbock                                                                            ●
                       reorder(city, rsq)




                                                            Garland                                                                           ●
                                               Longview−Marshall                                                                              ●
                                                            Midland                                                                           ●
                                                             Laredo                                                                          ●
                                                          Harlingen                                                                      ●
                                                             Abilene                                                                     ●
                                                Killeen−Fort Hood                                                                        ●
                                                  Brazoria County                                                                    ●
                                                            McAllen                                                                  ●
                                                        Brownsville                                                                  ●
                                                      Wichita Falls                                                                  ●
                                               Sherman−Denison                                                                   ●
                                                               Irving                                                        ●
                                                         Galveston                                                           ●
                                                            Odessa                                                           ●
                                                       San Marcos                                                        ●
                                                       San Angelo                                                    ●
                                                            Amarillo                                            ●
                                                     Nacogdoches                                           ●
                                                              Lufkin                                   ●
                                                             Victoria                                  ●
                                                         Beaumont                                     ●
                                                         Texarkana                                ●
                                                                Paris                        ●
                                                           Palestine                    ●
                                                            El Paso                 ●
                                                        Port Arthur     ●



                                                                              0.5           0.6                0.7               0.8               0.9
                                          rsq
qplot(rsq, reorder(city, rsq), data = quality)
Tuesday, 7 July 2009
quality$poor <- quality$rsq < 0.7
      tx2 <- merge(tx, quality, by = "city")

      mfit <- ldply(models3, function(mod) {
         data.frame(
           resid = resid(mod),
           pred = predict(mod))
      })
      tx2 <- cbind(tx2, mfit[, -1])

                       Can you think of any potential
                         problems with this line?


Tuesday, 7 July 2009
Abilene          Amarillo        Arlington        Austin       Bay Area      Beaumont      Brazoria County
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                         BrownsvilleBryan−College Station
                                                        Collin County   Corpus Christi    Dallas     Denton County      El Paso
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                         Fort Bend        Fort Worth       Galveston       Garland       Harlingen     Houston           Irving
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
 log10(sales)




                      Killeen−Fort Hood    Laredo      Longview−Marshall   Lubbock        Lufkin        McAllen         Midland
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                  Montgomery CountyNacogdoches NE Tarrant County Odessa      Palestine     Paris   Port Arthur
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                      San Angelo    San Antonio  San Marcos Sherman−DenisonTemple−Belton Texarkana    Tyler
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                        Victoria      Waco       Wichita Falls
                3.5
                3.0
                2.5
                2.0
                1.5
                1.0
                0.5
                   2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
                     2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006   Raw data
                                                                           date
Tuesday, 7 July 2009
Abilene          Amarillo        Arlington        Austin       Bay Area      Beaumont      Brazoria County
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                 BrownsvilleBryan−College Station
                                                Collin County   Corpus Christi    Dallas     Denton County      El Paso
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                 Fort Bend        Fort Worth       Galveston       Garland       Harlingen     Houston           Irving
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
              Killeen−Fort Hood    Laredo      Longview−Marshall   Lubbock        Lufkin        McAllen         Midland
        3.5
        3.0
 pred




        2.5
        2.0
        1.5
        1.0
          Montgomery CountyNacogdoches NE Tarrant County Odessa      Palestine     Paris   Port Arthur
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
              San Angelo    San Antonio  San Marcos Sherman−DenisonTemple−Belton Texarkana    Tyler
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                Victoria      Waco       Wichita Falls
        3.5
        3.0
        2.5
        2.0
        1.5
        1.0
                                                                                                  Predictions
           2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
             2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006
                                                                   date
Tuesday, 7 July 2009
Abilene       Amarillo         Arlington        Austin        Bay Area      Beaumont      Brazoria County
          0.4
          0.2
          0.0
         −0.2
         −0.4
                   BrownsvilleBryan−College Station
                                                  Collin County     Corpus Christi    Dallas     Denton County      El Paso
          0.4
          0.2
          0.0
         −0.2
         −0.4
                   Fort Bend        Fort Worth        Galveston        Garland       Harlingen     Houston           Irving
          0.4
          0.2
          0.0
         −0.2
         −0.4
                Killeen−Fort Hood     Laredo      Longview−Marshall   Lubbock         Lufkin        McAllen         Midland
          0.4
          0.2
 resid




          0.0
         −0.2
         −0.4
            Montgomery County
                            Nacogdoches NE Tarrant County              Odessa        Palestine       Paris         Port Arthur
          0.4
          0.2
          0.0
         −0.2
         −0.4
                   San Angelo       San Antonio      San Marcos Sherman−DenisonTemple−Belton       Texarkana          Tyler
          0.4
          0.2
          0.0
         −0.2
         −0.4
                       Victoria       Waco          Wichita Falls
          0.4
          0.2
          0.0
         −0.2
         −0.4
            2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008
              2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006      Residuals
                                                                      date
Tuesday, 7 July 2009
Conclusions
                       Simple (and relatively small) example, but
                       shows how collections of models can be
                       useful for gaining understanding.
                       Each attempt illustrated something new
                       about the data.
                       Plyr made it easy to create and summarise
                       collection of models, so we didn’t have to
                       worry about the mechanics.


Tuesday, 7 July 2009

Contenu connexe

Similaire à Modelling Large Housing Data

terex Jefferies081308
terex Jefferies081308terex Jefferies081308
terex Jefferies081308finance42
 
terex Jefferies081308
terex Jefferies081308terex Jefferies081308
terex Jefferies081308finance42
 
terex Shareholders0508
terex Shareholders0508terex Shareholders0508
terex Shareholders0508finance42
 
terex Merrill050808
terex Merrill050808terex Merrill050808
terex Merrill050808finance42
 
terex Merrill050808
terex Merrill050808terex Merrill050808
terex Merrill050808finance42
 
terex BofA050808
terex BofA050808terex BofA050808
terex BofA050808finance42
 
terex BofA050808
terex BofA050808terex BofA050808
terex BofA050808finance42
 

Similaire à Modelling Large Housing Data (8)

18 Modelling
18 Modelling18 Modelling
18 Modelling
 
terex Jefferies081308
terex Jefferies081308terex Jefferies081308
terex Jefferies081308
 
terex Jefferies081308
terex Jefferies081308terex Jefferies081308
terex Jefferies081308
 
terex Shareholders0508
terex Shareholders0508terex Shareholders0508
terex Shareholders0508
 
terex Merrill050808
terex Merrill050808terex Merrill050808
terex Merrill050808
 
terex Merrill050808
terex Merrill050808terex Merrill050808
terex Merrill050808
 
terex BofA050808
terex BofA050808terex BofA050808
terex BofA050808
 
terex BofA050808
terex BofA050808terex BofA050808
terex BofA050808
 

Plus de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Dernier

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Dernier (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Modelling Large Housing Data

  • 1. plyr Modelling large data Hadley Wickham Tuesday, 7 July 2009
  • 2. 1. Strategy for analysing large data. 2. Introduction to the Texas housing data. 3. What’s happening in Houston? 4. Using a models as a tool 5. Using models in their own right Tuesday, 7 July 2009
  • 3. Large data strategy Start with a single unit, and identify interesting patterns. Summarise patterns with a model. Apply model to all units. Look for units that don’t fit the pattern. Summarise with a single model. Tuesday, 7 July 2009
  • 4. Texas housing data For each metropolitan area (45) in Texas, for each month from 2000 to 2009 (112): Number of houses listed and sold Total value of houses, and average sale price Average time on market CC BY http://www.flickr.com/photos/imagesbywestfall/3510831277/ Tuesday, 7 July 2009
  • 5. Strategy Start with a single city (Houston). Explore patterns & fit models. Apply models to all cities. Tuesday, 7 July 2009
  • 6. 220000 200000 avgprice 180000 160000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 7. 8000 7000 6000 sales 5000 4000 3000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 8. 6.5 6.0 onmarket 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 9. Seasonal trends Make it much harder to see long term trend. How can we remove the trend? (Many sophisticated techniques from time series, but what’s the simplest thing that might work?) Tuesday, 7 July 2009
  • 10. 220000 200000 avgprice 180000 160000 2 4 6 8 10 12 month Tuesday, 7 July 2009
  • 11. Challenge What does the following function do? deseas <- function(var, month) { resid(lm(var ~ factor(month))) + mean(var, na.rm = TRUE) } How could you use it in conjunction with transform to deasonalise the data? What if you wanted to deasonalise every city? Tuesday, 7 July 2009
  • 12. houston <- transform(houston, avgprice_ds = deseas(avgprice, month), listings_ds = deseas(listings, month), sales_ds = deseas(sales, month), onmarket_ds = deseas(onmarket, month) ) qplot(month, sales_ds, data = houston, geom = "line", group = year) + avg Tuesday, 7 July 2009
  • 13. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2 4 6 8 10 12 month Tuesday, 7 July 2009
  • 14. Model as tools Here we’re using the linear model as a tool - we don’t care about the coefficients or the standard errors, just using it to get rid of a striking pattern. Tukey described this pattern as residuals and reiteration: by removing a striking pattern we can see more subtle patterns. Tuesday, 7 July 2009
  • 15. 210000 200000 190000 avgprice_ds 180000 170000 160000 150000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 16. 7000 6500 6000 sales_ds 5500 5000 4500 4000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 17. 6.5 6.0 onmarket_ds 5.5 5.0 4.5 4.0 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 18. Summary Most variables seem to be combination of strong seasonal pattern plus weaker long- term trend. How do these patterns hold up for the rest of Texas? We’ll focus on sales. Tuesday, 7 July 2009
  • 19. 8000 6000 sales 4000 2000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 20. tx <- read.csv("tx-house-sales.csv") qplot(date, sales, data = tx, geom = "line", group = city) tx <- ddply(tx, "city", transform, sales_ds = deseas(sales, month)) qplot(date, sales_ds, data = tx, geom = "line", group = city) Tuesday, 7 July 2009
  • 21. 7000 6000 5000 sales_ds 4000 3000 2000 1000 2000 2002 2004 2006 2008 date Tuesday, 7 July 2009
  • 22. It works, but... It doesn’t give us any insight into the similarity of the patterns across multiple cities. Are the trends the same or different? So instead of throwing the models away and just using the residuals, let’s keep the models and explore them in more depth. Tuesday, 7 July 2009
  • 23. Two new tools dlply: takes a data frame, splits up in the same way as ddply, applies function to each piece and combines the results into a list ldply: takes a list, splits up into elements, applies function to each piece and then combines the results into a data frame dlply + ldply = ddply Tuesday, 7 July 2009
  • 24. models <- dlply(tx, "city", function(df) lm(sales ~ factor(month), data = df)) models[[1]] coef(models[[1]]) ldply(models, coef) Tuesday, 7 July 2009
  • 25. Labelling Notice we didn’t have to do anything to have the coefficients labelled correctly. Behind the scenes plyr records the labels used for the split step, and ensures they are preserved across multiple plyr calls. Tuesday, 7 July 2009
  • 26. Back to the model What are some problems with this model? How could you fix them? Is the format of the coefficients optimal? Turn to the person next to you and discuss for 2 minutes. Tuesday, 7 July 2009
  • 27. qplot(date, log10(sales), data = tx, geom = "line", group = city) models2 <- dlply(tx, "city", function(df) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Tuesday, 7 July 2009
  • 28. qplot(date, log10(sales), data = tx, geom = "line", group = city) Log transform sales to make coefficients models2 <- dlply(tx, "city", function(df) comparable (ratios) lm(log10(sales) ~ factor(month), data = df)) coef2 <- ldply(models2, function(mod) { data.frame( month = 1:12, effect = c(0, coef(mod)[-1]), intercept = coef(mod)[1]) }) Puts coefficients in rows, so they can be plotted more easily Tuesday, 7 July 2009
  • 29. 0.4 0.3 0.2 effect 0.1 0.0 −0.1 2 4 6 8 10 12 qplot(month, effect, data = coef2, group month = city, geom = "line") Tuesday, 7 July 2009
  • 30. 2.5 2.0 10^effect 1.5 1.0 2 4 6 8 10 12 month qplot(month, 10 ^ effect, data = coef2, group = city, geom = "line") Tuesday, 7 July 2009
  • 31. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 10^effect 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 2.5 2.0 1.5 1.0 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 2 4 6 8 1012 month qplot(month, 10 ^ effect, data = coef2, geom = "line") + facet_wrap(~ city) Tuesday, 7 July 2009
  • 32. What should we do next? What do you think? You have 30 seconds to come up with (at least) one idea. Tuesday, 7 July 2009
  • 33. My ideas Fit a single model, log(sales) ~ city * factor(month), and look at residuals Fit individual models, log(sales) ~ factor(month) + ns(date, 3), look cities that don’t fit Tuesday, 7 July 2009
  • 34. # One approach - fit a single model mod <- lm(log10(sales) ~ city + factor(month), data = tx) tx$sales2 <- 10 ^ resid(mod) qplot(date, sales2, data = tx, geom = "line", group = city) last_plot() + facet_wrap(~ city) Tuesday, 7 July 2009
  • 35. 3.5 3.0 2.5 sales2 2.0 1.5 1.0 0.5 2000 2002 2004 2006 2008 date qplot(date, sales2, data = tx, geom = "line", group = city) Tuesday, 7 July 2009
  • 36. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 sales2 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date last_plot() + facet_wrap(~ city) Tuesday, 7 July 2009
  • 37. # Another approach: Essence of most cities is seasonal # term plus long term smooth trend. We could fit this # model to each city, and then look for models which don't # fit well. library(splines) models3 <- dlply(tx, "city", function(df) { lm(log10(sales) ~ factor(month) + ns(date, 3), data = df) }) # Extract rsquared from each model rsq <- function(mod) c(rsq = summary(mod)$r.squared) quality <- ldply(models3, rsq) Tuesday, 7 July 2009
  • 38. Wichita Falls ● Waco ● Victoria ● Tyler ● Texarkana ● Temple−Belton ● Sherman−Denison ● San Marcos ● San Antonio ● San Angelo ● Port Arthur ● Paris ● Palestine ● Odessa ● NE Tarrant County ● Nacogdoches ● Montgomery County ● Midland ● McAllen ● Lufkin ● Lubbock ● Longview−Marshall ● Laredo ● city Killeen−Fort Hood ● Irving ● Houston ● Harlingen ● Garland ● Galveston ● Fort Worth ● Fort Bend ● El Paso ● Denton County ● Dallas ● Corpus Christi ● Collin County ● Bryan−College Station ● Brownsville ● Brazoria County ● Beaumont ● Bay Area ● Austin ● Arlington ● Amarillo ● Abilene ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, city, data = quality) Tuesday, 7 July 2009
  • 39. San Antonio ● Montgomery County ● Houston How are the good ● Dallas ● Bryan−College Station Collin County fits different from ● ● Denton County Austin the bad fits? ● ● Fort Bend ● Fort Worth ● Tyler ● NE Tarrant County ● Bay Area ● Corpus Christi ● Arlington ● Waco ● Temple−Belton ● Lubbock ● reorder(city, rsq) Garland ● Longview−Marshall ● Midland ● Laredo ● Harlingen ● Abilene ● Killeen−Fort Hood ● Brazoria County ● McAllen ● Brownsville ● Wichita Falls ● Sherman−Denison ● Irving ● Galveston ● Odessa ● San Marcos ● San Angelo ● Amarillo ● Nacogdoches ● Lufkin ● Victoria ● Beaumont ● Texarkana ● Paris ● Palestine ● El Paso ● Port Arthur ● 0.5 0.6 0.7 0.8 0.9 rsq qplot(rsq, reorder(city, rsq), data = quality) Tuesday, 7 July 2009
  • 40. quality$poor <- quality$rsq < 0.7 tx2 <- merge(tx, quality, by = "city") mfit <- ldply(models3, function(mod) { data.frame( resid = resid(mod), pred = predict(mod)) }) tx2 <- cbind(tx2, mfit[, -1]) Can you think of any potential problems with this line? Tuesday, 7 July 2009
  • 41. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 0.5 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 0.5 log10(sales) Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 0.5 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Raw data date Tuesday, 7 July 2009
  • 42. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 3.5 3.0 2.5 2.0 1.5 1.0 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 3.5 3.0 2.5 2.0 1.5 1.0 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 3.5 3.0 2.5 2.0 1.5 1.0 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 3.5 3.0 pred 2.5 2.0 1.5 1.0 Montgomery CountyNacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 3.5 3.0 2.5 2.0 1.5 1.0 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 3.5 3.0 2.5 2.0 1.5 1.0 Victoria Waco Wichita Falls 3.5 3.0 2.5 2.0 1.5 1.0 Predictions 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 date Tuesday, 7 July 2009
  • 43. Abilene Amarillo Arlington Austin Bay Area Beaumont Brazoria County 0.4 0.2 0.0 −0.2 −0.4 BrownsvilleBryan−College Station Collin County Corpus Christi Dallas Denton County El Paso 0.4 0.2 0.0 −0.2 −0.4 Fort Bend Fort Worth Galveston Garland Harlingen Houston Irving 0.4 0.2 0.0 −0.2 −0.4 Killeen−Fort Hood Laredo Longview−Marshall Lubbock Lufkin McAllen Midland 0.4 0.2 resid 0.0 −0.2 −0.4 Montgomery County Nacogdoches NE Tarrant County Odessa Palestine Paris Port Arthur 0.4 0.2 0.0 −0.2 −0.4 San Angelo San Antonio San Marcos Sherman−DenisonTemple−Belton Texarkana Tyler 0.4 0.2 0.0 −0.2 −0.4 Victoria Waco Wichita Falls 0.4 0.2 0.0 −0.2 −0.4 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 2000 2004 2008 2002 2006 Residuals date Tuesday, 7 July 2009
  • 44. Conclusions Simple (and relatively small) example, but shows how collections of models can be useful for gaining understanding. Each attempt illustrated something new about the data. Plyr made it easy to create and summarise collection of models, so we didn’t have to worry about the mechanics. Tuesday, 7 July 2009