SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Revolution Confidential




T he R is e in Multiple
  B irths in the U.S .:
  A n A nalys is of a
   Hundred-Million
B irth R ec ords with R
   S us an I. R anney, P h.D.

           J S M 2012
T he Tools                                        Revolution Confidential




 Open Source R
   Flexible, powerful, great graphics
   Great for prototyping, not very scalable
 RevoScaleR (with Revolution R Enterprise)
   Efficient file format (.xdf)
   Functions of accessing/importing external data sets
    (fixed format & delimited text, SPSS, SAS, ODBC)
   Very fast, parallelized, distributed analysis functions
    (summary stats, crosstabs/cubes, linear models,
    kmeans clustering, logistic regression, glm)

                                                                    2
T he Data                                        Revolution Confidential




 Public-use data sets containing information
  on all births in the United States for each
  year from 1985 to 2009 are available to
  download:
 http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
 “These natality files are gigantic; they’re
  approximately 3.1 GB uncompressed.
  That’s a little larger than R can easily
  process” – Joseph Adler, R in a Nutshell

                                                                   3
T he U.S . B irth Data (c ontinued)                Revolution Confidential




 Data for each year are contained in a compressed,
  fixed-format, text files
 Typically 3 to 4 million records per file
 Variables and structure of the files sometimes change
  from year to year, with birth certificates revised in1989
  and 2003. Warnings:
  NOTE: THE RECORD LAYOUT OF THIS FILE HAS
      CHANGED SUBSTANTIALLY. USERS
      SHOULD READ THIS DOCUMENT
      CAREFULLY.
 Reporting can differ state-to-state for any given year

                                                                     4
T he P roc es s                               Revolution Confidential




      Basic “life cycle of analysis”
         Import and combine data
           25 years
           Over 100 million obs.
         Check and clean data
         Basic variable summaries
         Big data logistic/glm regressions
      All on my laptop
      Option to distribute computations
       to cluster
                                                                5
T he Ques tion            Revolution Confidential



CDC Report in Jan. 2012




                                            6
T he Ques tion                                  Revolution Confidential




   What accounts for the increase in multiple
   births in the United States?
     Can we separate out effects of mother’s and
     father’s ages, race/ethnicity?
     Examine time trends by sub-group, assumed
     to be associated with fertility treatment
     CDC finding: The older age of women at
     childbirth accounts for only about 1/3 of the rise
     in twinning over 30 years (but this mixes in the
     increased rate of “twinning” for older women)


                                                                  7
Revolution Confidential




Importing the U.S . B irth Data for
            Us e in R




                                                  8
E xample of Differenc es for Different Years  Revolution Confidential




To create common variables across years, use
  common names and new factor levels for ‘colInfo’ in
  rxImport function. For example:
 For 1985:
  SEX = list(type="factor", start=35, width=1,
     levels=c("1", "2"),
     newLevels = c("Male", "Female"),
     description = "Sex of Infant“)
 For 2003:
  SEX = list(type="factor", start=436, width=1,
      levels=c("M", "F"),
      newLevels = c("Male", "Female"),
      description = "Sex of Infant”)


                                                                9
C reating Trans formed Variables on Import  Revolution Confidential




Use standard R syntax to create transformed
  variables.
 For example, create a factor for Mom’s Age
  using the imported MAGER integer variable:
   MomAgeR7 = cut(MAGER,
     breaks =c(0, 19, 24, 29, 34, 39, 98, 99),
     labels = c("Under 20", "20-24", "25-29",
     "30-34", "35-39", "Over 39", "Missing"))
 Create binary variable for “IsMultiple”
   IsMultiple = DPLURAL_REC != 'Single'


                                                             10
S teps for C omplete Import                Revolution Confidential




 Lists for column information and transforms are
  created for 3 base years: 1985, 1989, 2003
  when there were very large changes in the
  structure of the input files
 Changes to these lists are made where
  appropriate for in-between years
 A test script is run, importing only 1000
  observations per year for a subset of years
 Full script is run, importing each year, sorting
  according to key demographic characteristics,
  and appending to a master .xdf file

                                                            11
Revolution Confidential




E xamining and C leaning the B ig
           Data F ile




                                               12
E xamining B as ic Information            Revolution Confidential




 Basic file information
>rxGetInfo(birthAll)
File name:
  C:RevolutionDataCDCBirthUS85to09S.xdf
  Number of observations: 100672041
  Number of variables: 50
  Number of blocks: 215
Use rxSummary to compute summary statistics
 for continuous data and category counts for
 each of the factors (about 4 minutes on my
 laptop)
rxSummary(~., data=birthAll, blocksPerRead = 10)


                                                           13
E xample of S ummary S tatis tic s     Revolution Confidential




MomAgeR7   Counts     DadAgeR8   Counts
Under 20   11918891   Under 20    3226602
20-24      25975642   20-24      15304803
25-29      28701398   25-29      23805056
30-34      22341530   30-34      23179418
35-39       9788753   35-39      13289015
Over 39     1945827   40-44       4984146
Missing           0   Over 44     2140207
                      Missing    14742794



                                                        14
His tograms by Year                     Revolution Confidential




Easily check for basic errors in data import
  (e.g. wrong position in file) by creating
  histograms by year – very fast (just seconds
  on my laptop)
 Example: Distribution of mother’s age by
  year. Use F() to have the integer year treated
  as a factor.
rxHistogram(~MAGER| F(DOB_YY),
  data=birthAll, blocksPerRead = 10,
  layout=c(5,5))

                                                         15
A ge of Mother Over Time   Revolution Confidential




                                            16
Drill Down and E xtrac t S ubs amples             Revolution Confidential




 Take a quick look at “older” fathers:   Dad’s
rxSummary(~F(UFAGECOMB),                  Age     Counts
  data=birthAll,                          80       141
  blocksPerRead = 10)                     81       108
                                          82        81
 What’s going on with 89-year old        83        74
  Dads? Extract a data frame:             84        56
dad89 <- rxDataStep(                      85        43
  inData = birthAll,                      86        43
  rowSelection = UFAGECOMB == 89,         87        26
  varsToKeep = c("DOB_YY", "MAGER",       88        27
  "MAR", "STATENAT", "FRACEREC"),         89      3327
  blocksPerRead = 10)

                                                                   17
Year and S tate for 89-Year-Old F athers
                                    Revolution Confidential




 rxCube(~F(DOB_YY):STATENAT,
   data=dad89, removeZeroCounts=TRUE)
       F_DOB_YY STATENAT   Counts
       1990 California     1
       1999 California     1
       2000 California     1
       1996 Hawaii         1
       1997 Louisiana      1
       1986 New Jersey     1
       1995 New Jersey     1
       1996 Ohio           1
       1989 Texas       3316
       1990 Texas          1
       2001 Texas          1
       1985 Washington     1
                                                     18
Revolution Confidential



Demographic s of Multiple B irths
         1985-2009




                                                19
Unit of Obs ervation: Delivery          Revolution Confidential




 Appropriate unit of observation is the
  delivery (resulting in 1 or more live births)
  rather than the individual birth.
 Use 1/Plurality as probability weight
 Alternative: Look at nearby records to
  compute the “Reported Delivery Birth Order”
  (RDPO), then select on only the 1st born in
  the delivery for the analysis


                                                         20
Revolution Confidential




                 21
Revolution Confidential




                 22
Revolution Confidential




                 23
Revolution Confidential




 Multiple B irths L ogis tic
R egres s ion & P redic tions




                                                 24
L ogis tic R egres s ion                                  Revolution Confidential



Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 +
   FRaceEthn + MRaceEthn +
   DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 +
   MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85

  File name: C:RevolutionDataCDCBirthUS85to09R.xdf
  Probability weights: PluralWeight
  Dependent variable(s): IsMultiple
  Total independent variables: 118 (Including number dropped: 12)
  Number of valid observations: 100672041
  Number of missing observations: 0 -2*LogLikelihood: 14555447.8514
    (Residual deviance on 100671935 degrees of freedom)


About 6 minutes on my laptop



                                                                           25
C ounts of Deliveries by all Demographic
C ombinations by Year                               Revolution Confidential




rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY),
   data = birthAllR, pweights = "PluralWeight",
   blocksPerRead = 10)

 Under 10 seconds to compute on my laptop
 Resulting data frame has 50,400 rows
  representing all the demographic
  combinations for each of the 25 years
 44,661 have counts <= 1000
 Provides input data for predictions and
  weights for aggregating predictions
                                                                     26
P redic t and A ggregate by Year             Revolution Confidential




predOut <- rxPredict(logitObj,
  data = catAllDF)
 Create predictions for each detailed
  demographic group
 Aggregate for each year using population
  percentages for each detailed group for each
  year
 Compare with actuals
 Perform “What if?” scenarios
   Scenario 1: Only change in demographic; no
    “fertility-treatment” time trand

                                                              27
Revolution Confidential




                 28
Drill Down: L ook at P redic tions for S pec ific
G roups                                   Revolution Confidential




 Select out predicted values from prediction
  data frame for detailed groups:
   Age of mother and father
   Race/ethnicity of mother and father




                                                           29
Revolution Confidential




                 30
Revolution Confidential




                 31
Revolution Confidential




C onc lus ions from L ogis tic R egres s ion     Revolution Confidential




 Dads matter: Use of fertility treatment
  associated with older dads as well as older
  Mom’s
 Hispanics show relatively small increase in
  multiples for both younger and older couples
 Asian pattern similar to whites, but lower for all
  age groups
 Black show similar increase in multiples for both
  younger and older couples
   Other Factors Related to Multiples? (Diet, high BMI,
    other genetic factors)
                                                                  32
Revolution Confidential




B ut How Many B abies ?                   Revolution Confidential




                                                           33
Revolution Confidential




G L M Tweedie Model: How Many B abies ?      Revolution Confidential




 When power parameter is between 1 and 2,
  Tweedie is a compound Poisson distribution
 Appropriate for data with positive data that
  also includes a ‘clump’ of exact zeros
 Dependent variable: Number of additional
  babies (Plurality – 1)
 Same independent variables



                                                              34
Revolution Confidential




                          Revolution Confidential




                                           35
Revolution Confidential




F urther R es earc h                               Revolution Confidential




 Data management
   Further cleaning of data
   Import more variables
   Import more years (additional use of weights)
 Multiple Births Analysis
   More variables (e.g., proxies for fertility treatment
    trends)
   Investigation of sub-groups (e.g., young blacks)
   Improved computation of number of births per
    delivery
 Other analyses with birth data
                                                                    36
Revolution Confidential




S ummary of A pproac h
                                               Revolution Confidential




 “Unpredictability” of multiple births requires
  large data set to have the power to capture
  effects
 Significant challenges in importing and cleaning
  the data – using R and .xdf files makes it
  possible
 Even with a huge data, “cells” of tables looking
  at multiple factors can be small
 Using combined.xdf file, we can use individual-
  level analysis to examine conditional effects of
  a variety of factors

                                                                37
Revolution Confidential




R eferenc es
                                                Revolution Confidential




 Martin JA, Hamilton BE, Osterman MJK. Three
  decades of twin births in the United States, 1980-
  2009. NCHS data brief, no. 80. Hyattsville, MD:
  National Center for Health Statistics. 2012.
 Blondel B, Kaminiski, M. Trends in the Occurrence,
  Determinants, and Consequences of Multiple
  Births. Seminars in Perinatology. 26(4):239-49,
  2002.
 Vahratian, Anjel. Utilization of Fertility-Related
  Services in the United States. Fertil Steril. 2008
  October; 90(4):1317-1319.
                                                                 38
Revolution Confidential




T hank you!
                                            Revolution Confidential




 R-Core Team
 R Package Developers
 R Community
 Revolution R Enterprise Customers and Beta
  Testers
 Colleagues at Revolution Analytics
Contact:
sue@revolutionanalytics.com
                                                             39

Contenu connexe

Similaire à The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

FormalWriteupTornado_1
FormalWriteupTornado_1FormalWriteupTornado_1
FormalWriteupTornado_1Katie Harvey
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...Revolution Analytics
 
Data mining
Data miningData mining
Data miningScilab
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_PresentationSourabh Gujar
 
Data management Final Project
Data management Final ProjectData management Final Project
Data management Final ProjectRutuja Gangane
 
Stata time-series-fall-2011
Stata time-series-fall-2011Stata time-series-fall-2011
Stata time-series-fall-2011Feryanto W K
 
Final Report
Final ReportFinal Report
Final ReportBrian Wu
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsA Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsTracy Hill
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxwkyra78
 

Similaire à The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R (20)

FormalWriteupTornado_1
FormalWriteupTornado_1FormalWriteupTornado_1
FormalWriteupTornado_1
 
Bioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignmentsBioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignments
 
Bill howe 3_mapreduce
Bill howe 3_mapreduceBill howe 3_mapreduce
Bill howe 3_mapreduce
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
Sofi Presentation
Sofi PresentationSofi Presentation
Sofi Presentation
 
Power DataViz
Power DataVizPower DataViz
Power DataViz
 
Data mining
Data miningData mining
Data mining
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_Presentation
 
Data management Final Project
Data management Final ProjectData management Final Project
Data management Final Project
 
Stata time-series-fall-2011
Stata time-series-fall-2011Stata time-series-fall-2011
Stata time-series-fall-2011
 
Final Report
Final ReportFinal Report
Final Report
 
Major project.pptx
Major project.pptxMajor project.pptx
Major project.pptx
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsA Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
 
XL MINER: Associations
XL MINER: AssociationsXL MINER: Associations
XL MINER: Associations
 

Plus de Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

Plus de Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Dernier

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

  • 1. Revolution Confidential T he R is e in Multiple B irths in the U.S .: A n A nalys is of a Hundred-Million B irth R ec ords with R S us an I. R anney, P h.D. J S M 2012
  • 2. T he Tools Revolution Confidential  Open Source R  Flexible, powerful, great graphics  Great for prototyping, not very scalable  RevoScaleR (with Revolution R Enterprise)  Efficient file format (.xdf)  Functions of accessing/importing external data sets (fixed format & delimited text, SPSS, SAS, ODBC)  Very fast, parallelized, distributed analysis functions (summary stats, crosstabs/cubes, linear models, kmeans clustering, logistic regression, glm) 2
  • 3. T he Data Revolution Confidential  Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm  “These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell 3
  • 4. T he U.S . B irth Data (c ontinued) Revolution Confidential  Data for each year are contained in a compressed, fixed-format, text files  Typically 3 to 4 million records per file  Variables and structure of the files sometimes change from year to year, with birth certificates revised in1989 and 2003. Warnings: NOTE: THE RECORD LAYOUT OF THIS FILE HAS CHANGED SUBSTANTIALLY. USERS SHOULD READ THIS DOCUMENT CAREFULLY.  Reporting can differ state-to-state for any given year 4
  • 5. T he P roc es s Revolution Confidential  Basic “life cycle of analysis”  Import and combine data  25 years  Over 100 million obs.  Check and clean data  Basic variable summaries  Big data logistic/glm regressions  All on my laptop  Option to distribute computations to cluster 5
  • 6. T he Ques tion Revolution Confidential CDC Report in Jan. 2012 6
  • 7. T he Ques tion Revolution Confidential What accounts for the increase in multiple births in the United States? Can we separate out effects of mother’s and father’s ages, race/ethnicity? Examine time trends by sub-group, assumed to be associated with fertility treatment CDC finding: The older age of women at childbirth accounts for only about 1/3 of the rise in twinning over 30 years (but this mixes in the increased rate of “twinning” for older women) 7
  • 8. Revolution Confidential Importing the U.S . B irth Data for Us e in R 8
  • 9. E xample of Differenc es for Different Years Revolution Confidential To create common variables across years, use common names and new factor levels for ‘colInfo’ in rxImport function. For example:  For 1985: SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"), description = "Sex of Infant“)  For 2003: SEX = list(type="factor", start=436, width=1, levels=c("M", "F"), newLevels = c("Male", "Female"), description = "Sex of Infant”) 9
  • 10. C reating Trans formed Variables on Import Revolution Confidential Use standard R syntax to create transformed variables.  For example, create a factor for Mom’s Age using the imported MAGER integer variable: MomAgeR7 = cut(MAGER, breaks =c(0, 19, 24, 29, 34, 39, 98, 99), labels = c("Under 20", "20-24", "25-29", "30-34", "35-39", "Over 39", "Missing"))  Create binary variable for “IsMultiple” IsMultiple = DPLURAL_REC != 'Single' 10
  • 11. S teps for C omplete Import Revolution Confidential  Lists for column information and transforms are created for 3 base years: 1985, 1989, 2003 when there were very large changes in the structure of the input files  Changes to these lists are made where appropriate for in-between years  A test script is run, importing only 1000 observations per year for a subset of years  Full script is run, importing each year, sorting according to key demographic characteristics, and appending to a master .xdf file 11
  • 12. Revolution Confidential E xamining and C leaning the B ig Data F ile 12
  • 13. E xamining B as ic Information Revolution Confidential  Basic file information >rxGetInfo(birthAll) File name: C:RevolutionDataCDCBirthUS85to09S.xdf Number of observations: 100672041 Number of variables: 50 Number of blocks: 215 Use rxSummary to compute summary statistics for continuous data and category counts for each of the factors (about 4 minutes on my laptop) rxSummary(~., data=birthAll, blocksPerRead = 10) 13
  • 14. E xample of S ummary S tatis tic s Revolution Confidential MomAgeR7 Counts DadAgeR8 Counts Under 20 11918891 Under 20 3226602 20-24 25975642 20-24 15304803 25-29 28701398 25-29 23805056 30-34 22341530 30-34 23179418 35-39 9788753 35-39 13289015 Over 39 1945827 40-44 4984146 Missing 0 Over 44 2140207 Missing 14742794 14
  • 15. His tograms by Year Revolution Confidential Easily check for basic errors in data import (e.g. wrong position in file) by creating histograms by year – very fast (just seconds on my laptop)  Example: Distribution of mother’s age by year. Use F() to have the integer year treated as a factor. rxHistogram(~MAGER| F(DOB_YY), data=birthAll, blocksPerRead = 10, layout=c(5,5)) 15
  • 16. A ge of Mother Over Time Revolution Confidential 16
  • 17. Drill Down and E xtrac t S ubs amples Revolution Confidential  Take a quick look at “older” fathers: Dad’s rxSummary(~F(UFAGECOMB), Age Counts data=birthAll, 80 141 blocksPerRead = 10) 81 108 82 81  What’s going on with 89-year old 83 74 Dads? Extract a data frame: 84 56 dad89 <- rxDataStep( 85 43 inData = birthAll, 86 43 rowSelection = UFAGECOMB == 89, 87 26 varsToKeep = c("DOB_YY", "MAGER", 88 27 "MAR", "STATENAT", "FRACEREC"), 89 3327 blocksPerRead = 10) 17
  • 18. Year and S tate for 89-Year-Old F athers Revolution Confidential rxCube(~F(DOB_YY):STATENAT, data=dad89, removeZeroCounts=TRUE) F_DOB_YY STATENAT Counts 1990 California 1 1999 California 1 2000 California 1 1996 Hawaii 1 1997 Louisiana 1 1986 New Jersey 1 1995 New Jersey 1 1996 Ohio 1 1989 Texas 3316 1990 Texas 1 2001 Texas 1 1985 Washington 1 18
  • 19. Revolution Confidential Demographic s of Multiple B irths 1985-2009 19
  • 20. Unit of Obs ervation: Delivery Revolution Confidential  Appropriate unit of observation is the delivery (resulting in 1 or more live births) rather than the individual birth.  Use 1/Plurality as probability weight  Alternative: Look at nearby records to compute the “Reported Delivery Birth Order” (RDPO), then select on only the 1st born in the delivery for the analysis 20
  • 24. Revolution Confidential Multiple B irths L ogis tic R egres s ion & P redic tions 24
  • 25. L ogis tic R egres s ion Revolution Confidential Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 + FRaceEthn + MRaceEthn + DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 + MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85 File name: C:RevolutionDataCDCBirthUS85to09R.xdf Probability weights: PluralWeight Dependent variable(s): IsMultiple Total independent variables: 118 (Including number dropped: 12) Number of valid observations: 100672041 Number of missing observations: 0 -2*LogLikelihood: 14555447.8514 (Residual deviance on 100671935 degrees of freedom) About 6 minutes on my laptop 25
  • 26. C ounts of Deliveries by all Demographic C ombinations by Year Revolution Confidential rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY), data = birthAllR, pweights = "PluralWeight", blocksPerRead = 10)  Under 10 seconds to compute on my laptop  Resulting data frame has 50,400 rows representing all the demographic combinations for each of the 25 years  44,661 have counts <= 1000  Provides input data for predictions and weights for aggregating predictions 26
  • 27. P redic t and A ggregate by Year Revolution Confidential predOut <- rxPredict(logitObj, data = catAllDF)  Create predictions for each detailed demographic group  Aggregate for each year using population percentages for each detailed group for each year  Compare with actuals  Perform “What if?” scenarios  Scenario 1: Only change in demographic; no “fertility-treatment” time trand 27
  • 29. Drill Down: L ook at P redic tions for S pec ific G roups Revolution Confidential  Select out predicted values from prediction data frame for detailed groups:  Age of mother and father  Race/ethnicity of mother and father 29
  • 32. Revolution Confidential C onc lus ions from L ogis tic R egres s ion Revolution Confidential  Dads matter: Use of fertility treatment associated with older dads as well as older Mom’s  Hispanics show relatively small increase in multiples for both younger and older couples  Asian pattern similar to whites, but lower for all age groups  Black show similar increase in multiples for both younger and older couples  Other Factors Related to Multiples? (Diet, high BMI, other genetic factors) 32
  • 33. Revolution Confidential B ut How Many B abies ? Revolution Confidential 33
  • 34. Revolution Confidential G L M Tweedie Model: How Many B abies ? Revolution Confidential  When power parameter is between 1 and 2, Tweedie is a compound Poisson distribution  Appropriate for data with positive data that also includes a ‘clump’ of exact zeros  Dependent variable: Number of additional babies (Plurality – 1)  Same independent variables 34
  • 35. Revolution Confidential Revolution Confidential 35
  • 36. Revolution Confidential F urther R es earc h Revolution Confidential  Data management  Further cleaning of data  Import more variables  Import more years (additional use of weights)  Multiple Births Analysis  More variables (e.g., proxies for fertility treatment trends)  Investigation of sub-groups (e.g., young blacks)  Improved computation of number of births per delivery  Other analyses with birth data 36
  • 37. Revolution Confidential S ummary of A pproac h Revolution Confidential  “Unpredictability” of multiple births requires large data set to have the power to capture effects  Significant challenges in importing and cleaning the data – using R and .xdf files makes it possible  Even with a huge data, “cells” of tables looking at multiple factors can be small  Using combined.xdf file, we can use individual- level analysis to examine conditional effects of a variety of factors 37
  • 38. Revolution Confidential R eferenc es Revolution Confidential  Martin JA, Hamilton BE, Osterman MJK. Three decades of twin births in the United States, 1980- 2009. NCHS data brief, no. 80. Hyattsville, MD: National Center for Health Statistics. 2012.  Blondel B, Kaminiski, M. Trends in the Occurrence, Determinants, and Consequences of Multiple Births. Seminars in Perinatology. 26(4):239-49, 2002.  Vahratian, Anjel. Utilization of Fertility-Related Services in the United States. Fertil Steril. 2008 October; 90(4):1317-1319. 38
  • 39. Revolution Confidential T hank you! Revolution Confidential  R-Core Team  R Package Developers  R Community  Revolution R Enterprise Customers and Beta Testers  Colleagues at Revolution Analytics Contact: sue@revolutionanalytics.com 39