SlideShare a Scribd company logo
1 of 12
STAT 656 – Spring 2011

Shootout

Group #1 (Team SATURN)




                Storm Impacts on
                Infectious Disease
                   Propagation


Prepared By:

Charles Gordon, Abigail Green, Uday Hejmadi, Prabu
Krishnamurthy, Bhargava Lakkaraju, Deepthi Uppalapati,
Li Yun Zhang


May 9, 2011
Executive Summary:

Even in the 21st century, weather plays a large role in people’s individual lives as well as in the
community in which they live. Weather events can affect the propagation of infectious
diseases by causing people to spend more time indoors in close contact with each other. Given
data on the number of health facility admissions by Diagnostic Related Group (DRG) infectious
diseases each week in various area codes for several years along with storm and weather data
for the preceding weeks, while factoring in the minimum and maximum incubation periods for
all specified DRGs, a model was built to find relationships between the predictor variables and
the number of admits. The analysis shows that some storms (Flood, Cold and Wind) can cause
a statistically significant increase in the number of admissions, while other storms (Winter and
Thunderstorms) do not play a decisive role. However, the storm and weather data have
meaningful interactions with Age Groups, that correspond to varying phases of biological
development, as well as the area’s population and the different DRGs themselves. Using the
selected model, healthcare management can forecast variation in healthcare usage and plan
accordingly.

Appendix A details the necessary steps to import the attached SAS Enterprise Miner .xml
diagram and repeat the analysis.

Introduction / Problem Statement:

Many factors impact the spread of disease. In this project, we analyze the impact of storms on
the incidence of infectious disease. Weather is believed to have direct impacts such as injuries,
drowning, freezing, exhaustion and dehydration as well as indirect impacts when people’s
behaviors are changed. For example, weather patterns affect how people congregate and as a
result the storms affect the rate of propagation of an infection through a community. The
presumption is that the more time people spend indoors near each other the more likely a
disease is likely to spread.

First, we will determine the types of storms (Wind, Thunderstorm, Flood, Winter storm, Cold)
that have a statistically significant impact on certain infectious diseases. Second, we will
propose a model that predicts the incidence of certain infectious diseases based on Diagnostic
Related Group (DRG), age-group, area code and week. The model could then be used to help
healthcare providers prepare for fluctuations in patient needs throughout the different weeks
of the year based on predicted patient numbers and diseases.

Data Preparation:

Several datasets were provided. An admits dataset provided the number of people admitted
for a specific infectious disease by area code, age-group, and week. The dataset was assumed
to be complete. The admits table (after changing the column title from “DRG_code” to
“DRG24” for compatibility) was joined to the DRG table, which provided the minimum and
maximum weeks required for the incubation of a given disease. In the description of several
DRGs there was a specified age-range for the DRG, so only populations in the described age-
range could be diagnosed with the DRG. The tables were joined based on DRG code. Then two
new columns were created, minimum and maximum storm allowable storm week, so that
storm data could be joined to the admits data. The minimum allowable storm week was
calculated by subtracting the maximum incubation period from the week that people were
admitted, and the maximum allowable storm week was calculated by subtracting the minimum
incubation period.

These datasets were then joined to the storm data, which provided the number of storms of a
given type (Wind, Thunderstorm, Flood, Winter, and Cold) that occurred in a given area code
and week. Storms were joined to admits by counting each type of storm that occurred during
the DRG-specific incubation period prior to the observed week.

The weather table was joined to the storm table to try and determine if any storms may
possibly be missing from the storm dataset. First, several of the area codes had multiple
observations in one week. These observations were very similar, so the mean of all the
observations was used to condense the multiple observations into one per week. The storm
and weather datasets were joined on area code and week. There were very few storm
observations with a count of zero, so great care would need to be taken when modeling
additional storms. The weather data was filtered to remove values of week > 418 as no
response data was available for the number of admits on weeks beginning with 419, and
missing values of the predictor variables were imputed with the Tree method for interval
variables. No indicator variables were created. The number of missing values ranged from 17
to 42 observations.

The two datasets “sas2011_population” and “sas2011_population_2005” were appended to
create the entire population data set, and the “age” column was changed to “age_group” with
more clearly defined values that matched the descriptions in the admits table. The population
data was merged with the admits data (that already contains DRG and storm data) by area-
code, age-group and week.

The calendar dataset was filtered to remove values of week > 418 and then merged to the
admits data set that was filtered to remove missing responses. It was merged by week, area-
code and age group. The calendar data provides the number of workdays and schooldays in a
given calendar week. The number of workdays and schooldays may impact the ability of a
storm to impact the spread of disease because it is assumed that there may be more human
contact at school and work to encourage the spread of disease. The calendar data at the time
of the storm impacts the potential for the spread of an infectious disease, and its impact may
vary by age group.

Data Exploration

This problem includes data on 23 separate diagnostic related group (DRG) codes, although
some of them are related. For example, there are three separate DRG codes for pneumonia
with the differences being with or without “CC” and also 0-17 versus 17+. The minimum
incubation period for all DRGs is 1 week with a max incubation period of 2-4 weeks.

The fictitious admits data has a separate record for each week number depending on the area
code, DRG code and age_group along with the response value of number of admits. Over 73%
of the records had 0 admits during that week with an average of 0.289. Only 0.4% of the
records had more than 2 admits during the week, with a maximum value of 14 that occurred on
two separate occasions. With a quick glance at the data sorted by number of admits, the same
predictors of Area_Code = URQ80YY, DRG24 = 89, and age_group = 65+ are heavily
concentrated at the high range of number of admits.

The population data is heavily right-skewed 90% of the age_group-specific populations in a
given area code are less than 5,000 although some groups do contain values of greater than
20,000. The median is just over 500 while the mean is over 1,500.

The storm dataset not only breaks down the types of storms in a given week for each area
code, but it also indicates how many storms occurred that week. Having multiple severe
storms, or even a larger storm system that contained more than one type of storm (e.g. Cold
and Flood in the same week), could play a large role in determining the number of admits with
an associated DRG. It is clear that some storms are major and other are minor relative to one
another after viewing the weather data and seeing the range of Snow in one week.

The weather data set gives more data including high and low average temperatures for the
week in addition to the lowest and highest temperature of the entire week. Precipitation and
Snow are contained in separate columns, and Snow must not be a subset of Precipitation as its
values can be greater than Precipitation. The hours of daylight is also given. An initial
hypothesis would be that greater hours of sunlight would lead to people venturing outdoors
more and contributing less to the spread of an infection disease. Given the large range of
temperatures, the data not only spans entire years but also encompasses a large variety of
locations. The weather data is highly correlated due to each area_code having a separate
record for each week. Adjacent weeks are expected to be highly correlated as well as having a
seasonal factor throughout the year. Some data was missing, but it was very minimal.

The calendar data set captures the data of the Sunday of each week, and the number of
workdays and schooldays indicate how much interaction people in the community may be
experiencing that week depending on their age. The summer weeks are easily spotted by
having 0 schooldays.

The area code will not be used as a predictor variable. We may want to generalize over all area
codes and not just the ones given in this data set. Additionally, the information contained by
the area code’s location and climate zone should be captured by the storm and weather data
sets.
The score data set contains information on area_code, DRG24, age_group and week number.
The number of admits is blank and must be modeled. The weather and storm historical data
should be carried through to help make inferences on the score data set, and it will be merged
by the area_code and week.

Data Mining

Once all of the data sets were properly added to the SAS Enterprise Miner diagram with the
necessary coding to append and merge where necessary along with some filtering and
imputation, a 20% sample of the data was taken to build the model against. This still included
hundreds of thousands of observations but allowed for quicker analytical processing time.
Additionally, the predictors did not have any categorical variables that occurred very
infrequently, so not too much information was lost. The sample was stratified using
proportional criteria to ensure that a representative sample was chosen. This was important
due to the fact that some of the response variable levels with a high number of admits did
occur infrequently. A minimum strata size of five was applied. The model was build, validated
and tested against this sample size, but the rules of the ultimately chosen model are applied in
whole to the scoring data set. Figure 1 shows the Enterprise Miner Process Flow Diagram. The
data mining begins after the final data merge.

Figure 1: Enterprise Miner Process Flow Diagram
Using the sample of the merged data sets, the data was partitioned using a Data Partition node
into 60% Training, 20% Validation and 20% Test. Using the default of the node, the partitions
were stratified on the Response variable of number of admits. While each data set will have
different records, this stratification will keep them as close as randomly possible. The default
random seed number 12345 was used in this node as well as all preceding and following nodes.
The subsequent models will be built on the Training data set per the rules specified in those
models. The Validation data set will be used to find the optimal number of steps or iterations in
the model based on the chosen criteria. This keeps the model from overfitting the data. If only
the Training data set was used, each additional rule of a model could increase its apparent fit to
the data, but this model and all of its rules may not be applicable to future data sets. Applying
this model is a large part of the expected outcome of this exercise. The Test data set is used
separately as it is not involved in either the building of the model or the selection of the best
model. The Test data set gives additional independent records for an unbiased measurement
of the results of the model.

The output from the Data Partition node was fed into model nodes including a Decision Tree
analysis using Misclassification criteria, a Decision Tree analysis using Average Square Error and
a Gradient Boosting model.

The first Decision Tree selected its final model based on the Misclassification rate of the
Validation data set. The decision tree in general determines its model not based on a
mathematical equation but instead on a set of splitting rules that determines the most likely
outcome of a record given all of the available predictors. First all of the data is grouped
together and a rule is developed that splits the group into two sub-groups such that the
difference between the groups is maximized. The maximization occurs by ultimately choosing
certain values of one variable and putting them onto one side of the tree while placing the
remaining observations on the other side of the tree after the calculations are performed on all
variables and split locations. This process is then repeated for each sub-tree on the Training
data set until one of the stopping rules is reached. Our response variable was number of
admits and it was treated as a continuous variable. Therefore, the exact number of admits had
to be modeled for the record to be properly identified, although only integers greater than or
equal to 0 were valid choices. Because the response variable was interval, ProbF was used as
the interval splitting rule criterion. The decision tree is able to handle missing values without
eliminating the record by placing them all in the left side of the branch and modifying its
splitting rules appropriately. The maximum branch size was 2 and the maximum branch depth
was 6; these are both defaults of the Decision Tree node.

The second Decision Tree selected its final model based on the Average Square Error of the
Validation data set. This assessment criterion seems more appropriate as the larger problem is
interested in determining which storms have the greatest impact on the number of admits with
the ability of the model to make predictions. Choosing the exact number of admits for a given
week is less important than making the most accurate predictions available. The final model
will be selected using Average Square Error as the assessment criterion, but a Misclassification
tree was also included to see how it compares. Because the response variable was interval,
ProbF was used as the interval splitting rule criterion. The decision tree is able to handle
missing values without eliminating the record by placing them all in the left side of the branch
and modifying its splitting rules appropriately. The maximum branch size was 2 and the
maximum branch depth was 6; these are both defaults of the Decision Tree node.

The Gradient Boosting node uses tree boosting to create a series of decision trees that together
form a single predictive model. A tree in the series is fit to the residual of the prediction from
the earlier trees in the series where the residual is defined in terms of the derivative of a loss
function. Boosting is a classification technique whereby the estimated probabilities are
adjusted by weight estimates, and the weight estimates are increased when the previous model
misclassified the response. The Gradient Boosting model in this diagram uses 50 iterations with
a Shrinkage value of 0.10 to reduce the prediction of each tree and a Training proportion of
60% where a different training sample is taken for each iteration. The other defaults of the
Gradient Boosting node were kept with the assessment measure being Average Square Error on
the Validation data set.

Results:

The Misclassification decision tree had nearly 500,000 degrees of freedom. The inputs included
18 interval variables and 2 nominal variables. The final selected model had only four terminal
leaves meaning there were only three splits. The first split was on Flood Storm count and the
second split was on Cold Storm count. The third and final split was on Highest Temperature of
the Week, with temperatures above 52.5 associated with higher admit rates (and the group size
of the leaf equal to the minimum value of 5). Further iterations improved the misclassification
rate of the training data set, but the misclassification rate of the validation data set reached its
lowest value at four iterations and remained at the same rate with further iterations.
Therefore, the simplest model (i.e. the one with the fewest iterations) with the best
misclassification rate was chosen as the final model. The cumulative lift chart for this model is
shown in Figure 2 below. Considering the top 10% of ordered cases, a lift of 10 indicates our
model provides a 10 times more likely outcome of our response variable of number of admits.

Figure 2: Cumulative Lift Chart for Misclassification Decision Tree
The Average Square Error decision tree had an identical set-up to the Misclassification decision
tree, except it had a different model-selecting criterion. The average square error was slightly
lower and preferential to the Misclassification decision tree, although the two trees had the
same misclassification rate on the validation data set. The final selected model had nine
terminal leaves. Figure 3 shows that the best model was chosen after 9 iterations, and the
decision would have been different if using misclassification rate. Figure 4 shows the Lift Chart;
the entire lift benefit is achieved by the 10th percentile on the Training data set. The cumulative
lift chart is nearly identical to that shown for the Misclassification decision tree, with a
cumulative lift value of 10 at the 10th percentile.

The first two splits in the Average Square Error decision tree matched those for the
Misclassification decision tree: Flood Storm count and then Cold Storm count. For the areas
with a higher flood storm count, larger populations above 3048.5 were much more likely to
have a larger number of admits. Highest temperature of the week was again used as a splitting
variable before Week Number was split on twice. Including the week number in the model may
not contribute much to future predictions, but it could indicate trends as to whether a certain
area is becoming more or less prone to the transfer of infectious diseases. Further down the
tree model, additional rules were again made on splitting on the Flood Storm and Cold Storm
data. These particular storms seem especially correlated with the number of admits in
subsequent weeks.
Figure 3: Model Iteration Plots for Average Square Error Decision Tree




Figure 4: Lift Chart for Average Square Error Decision Tree




The final model selected from the Gradient Boosting node was chosen based on average square
error of the Validation data set as opposed to the default profit criterion. This model
performed very similarly to the two decision tree models with a nearly identical cumulative lift
chart, but the predictor variables were considerably different. The first and third most
important variables were the DRG code and the Age Group, respectively. This indicates that
there is much more involved than just the type of storm. There are strong interaction effects
that determine the number of admits. Additionally, the only storm count that incurred any
splitting was Wind Storm.

Figure 5: Variable Importance for Gradient Boosting model

                                                                                   Ration of
                                                                                   Validation
                                             Splitting                Validation   to Training     Interaction
Variable Name        Variable Label          Rules       Importance   Importance Importance        Importance
DRG24                DRG24                        317             1              1             1   0.05705945
WindStormCount                                    620    0.90436482   0.68292926 0.755147979       0.05235151
age_group                                         150    0.54379129   0.31783064 0.584471741       0.03751741
_NODEID_                                            39   0.23469854   0.24760786 1.055003845       0.00276443
week                 week                         374    0.43223949   0.20742442 0.479883087       0.00319923
Schooldays           Schooldays                     72   0.30115671   0.18227595 0.605252845       0.01177374
                     Imputed: Lowest
IMP_MinLowT          temperature of the            47 0.34987899 0.16313048 0.466248272 0.00187398
                     week
                     Imputed: Average low
IMP_AvgLowT          temperature of the            13 0.10240737 0.05075145 0.495583996 NaN
                     week
                     Imputed: Highest
IMP_MaxHighT         temperature of the            72 0.15980206 0.04996011 0.312637462 NaN
                     week
IMP_week             Imputed: week                 79 0.18864154 0.01674544 0.088768586 NaN
                     Imputed: Ddaylight of
IMP_Daylight                                        0            0              0 NaN              NaN
                     the week
Workdays             Workdays                       0            0              0 NaN              NaN
coldStormCount                                      0            0              0 NaN              NaN
ThunderStormCount                                   0            0              0 NaN              NaN
                     Imputed: Average
IMP_AvgHighT         high temperature of            0            0              0 NaN              NaN
                     the week
WinterStormCount                                    0            0              0 NaN              NaN
                     Imputed: Inches of
IMP_snow                                            0            0              0 NaN              NaN
                     snow
FloodStormCount                                     0            0              0 NaN              NaN
population                                          0            0              0 NaN              NaN
                     Imputed:
IMP_prcp                                            0            0              0 NaN              NaN
                     Precipitation


The model comparison node selected the Gradient Boosting model as it had the lowest
validation average square error value of 0.0246. The Average Square Error Decision Tree was
second with a validation average square error value of 0.0249, and the Misclassification
Decision Tree was not far behind at 0.0250. Timeout errors were received when trying to fit
regression and neural node models; the root cause is unknown.
Conclusions:

After meticulous merging of the given data set with all available weather, storm, population,
calendar and DRG parameters, the developed models achieved a low average square error
value of less than 0.0250 using the number of admits as the response variable. The
misclassification rate of all of the models was primarily based on the prevalence of the weeks
with admits equal to 0; however, the low average square error of the selected model lends
some confidence to making predictions.

The Gradient Boosting model was chosen over the two Decision Tree models, although they all
had similar response characteristics. The Wind, Flood and Cold storms were involved in the
models, but Thunderstorm and Winter storms were not. Storms that affect people’s behavior
and keep them indoors in close proximity to others are more likely to contribute to the spread
of infectious diseases. In that light, wind storms can knock out electricity and cause severe
damage, impairing travel and normal school and business functions. Floods and cold storms
can likewise be major disruptions. On the other hand, people are accustomed to
thunderstorms and winter storms and have methods of dealing with them while not greatly
altering their normal lifestyle.

The DRG code as well as the Age Group played significant roles in the chosen model. The
interactions between these variables and the weather data are evident; certain events will
affect some groups and not others. Additionally, some of the DRGs only apply to certain Age
Groups, so the relationship is not surprising. For cold storms, the respiratory illnesses including
bronchitis, pneumonia and respiratory infections were most positively correlated with an
increase in cold storms while viral illnesses and fevers actually had a negative correlation. For
floods, viral illness and fever as well as otitis media and URI had a positive correlation with an
increase storm incidence. These relationships are built into the model, so the model had
predictive capabilities as well as general modeling capabilities if planners want to see what their
admission needs might be given certain scenarios.
Appendix A: SAS Enterprise Miner 6.2 .XML Diagram Instructions

In order to run the attached “Shootout_Team_SATURN.xml” file, open Enterprise Miner and
create a new project. In the Project Start Code, add a libname titled “shootout” with a path to
a directory containing the SAS2011_STORM dataset. Add the following 7 data sources to the
project: CALENDAR_DATA (created from Calendar.xls), DRG_LIST_DATA (created from
DRG_list.xls), SAS2011_ADMITS, SAS2011_POPULATION, SAS2011_POPULATION_2005,
SAS2011_WEATHER, SCORE_DATA. All of the datasets were provided by the SAS Shootout 2011
project. Import the Shootout_Team_SATURN.xml diagram and run all nodes.

More Related Content

Similar to Sas Shootout Team Report

SuperGoodAdvice_06142015_Final
SuperGoodAdvice_06142015_FinalSuperGoodAdvice_06142015_Final
SuperGoodAdvice_06142015_Final
Joyce Rose
 
Zhao_Danton_SR16_Poster
Zhao_Danton_SR16_PosterZhao_Danton_SR16_Poster
Zhao_Danton_SR16_Poster
Danton Zhao
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Capstone Poster Design 4-23
Capstone Poster Design 4-23Capstone Poster Design 4-23
Capstone Poster Design 4-23
Trevor Bengtsson
 
A review of some indices used for drought studies
A review of some indices used for drought studiesA review of some indices used for drought studies
A review of some indices used for drought studies
Alexander Decker
 
A review of some indices used for drought studies
A review of some indices used for drought studiesA review of some indices used for drought studies
A review of some indices used for drought studies
Alexander Decker
 
Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...
Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...
Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...
anajencirestrepo
 
Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...
Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...
Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...
Agriculture Journal IJOEAR
 

Similar to Sas Shootout Team Report (20)

SuperGoodAdvice_06142015_Final
SuperGoodAdvice_06142015_FinalSuperGoodAdvice_06142015_Final
SuperGoodAdvice_06142015_Final
 
Ambee Historical Wildfire Data Everything You Need To Know
Ambee Historical Wildfire Data Everything You Need To KnowAmbee Historical Wildfire Data Everything You Need To Know
Ambee Historical Wildfire Data Everything You Need To Know
 
Dengue Outrage Forecasting via SAS
Dengue Outrage Forecasting via SASDengue Outrage Forecasting via SAS
Dengue Outrage Forecasting via SAS
 
Zhao_Danton_SR16_Poster
Zhao_Danton_SR16_PosterZhao_Danton_SR16_Poster
Zhao_Danton_SR16_Poster
 
Ambee’s Fire Data For Early Fire Detection and Rapid Response
Ambee’s Fire Data For Early Fire Detection and Rapid ResponseAmbee’s Fire Data For Early Fire Detection and Rapid Response
Ambee’s Fire Data For Early Fire Detection and Rapid Response
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Predicting west nile virus in mosquitos across the city of chicago
Predicting west nile virus in mosquitos across the city of chicagoPredicting west nile virus in mosquitos across the city of chicago
Predicting west nile virus in mosquitos across the city of chicago
 
Science.abb3221
Science.abb3221Science.abb3221
Science.abb3221
 
Day 2 Speaker Presentation - Dr Rachel Lowe
Day 2 Speaker Presentation - Dr Rachel LoweDay 2 Speaker Presentation - Dr Rachel Lowe
Day 2 Speaker Presentation - Dr Rachel Lowe
 
Capstone Poster Design 4-23
Capstone Poster Design 4-23Capstone Poster Design 4-23
Capstone Poster Design 4-23
 
SRGE COVID-19 Publications 2020
SRGE COVID-19 Publications 2020SRGE COVID-19 Publications 2020
SRGE COVID-19 Publications 2020
 
A review of some indices used for drought studies
A review of some indices used for drought studiesA review of some indices used for drought studies
A review of some indices used for drought studies
 
A review of some indices used for drought studies
A review of some indices used for drought studiesA review of some indices used for drought studies
A review of some indices used for drought studies
 
Data Analytics - Rit D4D_senegal_poster_31_december_2014
Data Analytics - Rit D4D_senegal_poster_31_december_2014Data Analytics - Rit D4D_senegal_poster_31_december_2014
Data Analytics - Rit D4D_senegal_poster_31_december_2014
 
Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...
Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...
Insect Conserv Diversity - 2021 - Kalinkat - Assessing long‐term effects of a...
 
Professor Aboul ella COVID-19 related publications
Professor Aboul ella COVID-19 related publications Professor Aboul ella COVID-19 related publications
Professor Aboul ella COVID-19 related publications
 
Data Analytics - Research paper on Rit D4D_senegal_scientific_paper_31_decemb...
Data Analytics - Research paper on Rit D4D_senegal_scientific_paper_31_decemb...Data Analytics - Research paper on Rit D4D_senegal_scientific_paper_31_decemb...
Data Analytics - Research paper on Rit D4D_senegal_scientific_paper_31_decemb...
 
Nmet powerpoint
Nmet powerpointNmet powerpoint
Nmet powerpoint
 
gayzeras
gayzerasgayzeras
gayzeras
 
Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...
Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...
Presentation of Four Centennial-long Global Gridded Datasets of the Standardi...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Sas Shootout Team Report

  • 1. STAT 656 – Spring 2011 Shootout Group #1 (Team SATURN) Storm Impacts on Infectious Disease Propagation Prepared By: Charles Gordon, Abigail Green, Uday Hejmadi, Prabu Krishnamurthy, Bhargava Lakkaraju, Deepthi Uppalapati, Li Yun Zhang May 9, 2011
  • 2. Executive Summary: Even in the 21st century, weather plays a large role in people’s individual lives as well as in the community in which they live. Weather events can affect the propagation of infectious diseases by causing people to spend more time indoors in close contact with each other. Given data on the number of health facility admissions by Diagnostic Related Group (DRG) infectious diseases each week in various area codes for several years along with storm and weather data for the preceding weeks, while factoring in the minimum and maximum incubation periods for all specified DRGs, a model was built to find relationships between the predictor variables and the number of admits. The analysis shows that some storms (Flood, Cold and Wind) can cause a statistically significant increase in the number of admissions, while other storms (Winter and Thunderstorms) do not play a decisive role. However, the storm and weather data have meaningful interactions with Age Groups, that correspond to varying phases of biological development, as well as the area’s population and the different DRGs themselves. Using the selected model, healthcare management can forecast variation in healthcare usage and plan accordingly. Appendix A details the necessary steps to import the attached SAS Enterprise Miner .xml diagram and repeat the analysis. Introduction / Problem Statement: Many factors impact the spread of disease. In this project, we analyze the impact of storms on the incidence of infectious disease. Weather is believed to have direct impacts such as injuries, drowning, freezing, exhaustion and dehydration as well as indirect impacts when people’s behaviors are changed. For example, weather patterns affect how people congregate and as a result the storms affect the rate of propagation of an infection through a community. The presumption is that the more time people spend indoors near each other the more likely a disease is likely to spread. First, we will determine the types of storms (Wind, Thunderstorm, Flood, Winter storm, Cold) that have a statistically significant impact on certain infectious diseases. Second, we will propose a model that predicts the incidence of certain infectious diseases based on Diagnostic Related Group (DRG), age-group, area code and week. The model could then be used to help healthcare providers prepare for fluctuations in patient needs throughout the different weeks of the year based on predicted patient numbers and diseases. Data Preparation: Several datasets were provided. An admits dataset provided the number of people admitted for a specific infectious disease by area code, age-group, and week. The dataset was assumed to be complete. The admits table (after changing the column title from “DRG_code” to “DRG24” for compatibility) was joined to the DRG table, which provided the minimum and maximum weeks required for the incubation of a given disease. In the description of several
  • 3. DRGs there was a specified age-range for the DRG, so only populations in the described age- range could be diagnosed with the DRG. The tables were joined based on DRG code. Then two new columns were created, minimum and maximum storm allowable storm week, so that storm data could be joined to the admits data. The minimum allowable storm week was calculated by subtracting the maximum incubation period from the week that people were admitted, and the maximum allowable storm week was calculated by subtracting the minimum incubation period. These datasets were then joined to the storm data, which provided the number of storms of a given type (Wind, Thunderstorm, Flood, Winter, and Cold) that occurred in a given area code and week. Storms were joined to admits by counting each type of storm that occurred during the DRG-specific incubation period prior to the observed week. The weather table was joined to the storm table to try and determine if any storms may possibly be missing from the storm dataset. First, several of the area codes had multiple observations in one week. These observations were very similar, so the mean of all the observations was used to condense the multiple observations into one per week. The storm and weather datasets were joined on area code and week. There were very few storm observations with a count of zero, so great care would need to be taken when modeling additional storms. The weather data was filtered to remove values of week > 418 as no response data was available for the number of admits on weeks beginning with 419, and missing values of the predictor variables were imputed with the Tree method for interval variables. No indicator variables were created. The number of missing values ranged from 17 to 42 observations. The two datasets “sas2011_population” and “sas2011_population_2005” were appended to create the entire population data set, and the “age” column was changed to “age_group” with more clearly defined values that matched the descriptions in the admits table. The population data was merged with the admits data (that already contains DRG and storm data) by area- code, age-group and week. The calendar dataset was filtered to remove values of week > 418 and then merged to the admits data set that was filtered to remove missing responses. It was merged by week, area- code and age group. The calendar data provides the number of workdays and schooldays in a given calendar week. The number of workdays and schooldays may impact the ability of a storm to impact the spread of disease because it is assumed that there may be more human contact at school and work to encourage the spread of disease. The calendar data at the time of the storm impacts the potential for the spread of an infectious disease, and its impact may vary by age group. Data Exploration This problem includes data on 23 separate diagnostic related group (DRG) codes, although some of them are related. For example, there are three separate DRG codes for pneumonia
  • 4. with the differences being with or without “CC” and also 0-17 versus 17+. The minimum incubation period for all DRGs is 1 week with a max incubation period of 2-4 weeks. The fictitious admits data has a separate record for each week number depending on the area code, DRG code and age_group along with the response value of number of admits. Over 73% of the records had 0 admits during that week with an average of 0.289. Only 0.4% of the records had more than 2 admits during the week, with a maximum value of 14 that occurred on two separate occasions. With a quick glance at the data sorted by number of admits, the same predictors of Area_Code = URQ80YY, DRG24 = 89, and age_group = 65+ are heavily concentrated at the high range of number of admits. The population data is heavily right-skewed 90% of the age_group-specific populations in a given area code are less than 5,000 although some groups do contain values of greater than 20,000. The median is just over 500 while the mean is over 1,500. The storm dataset not only breaks down the types of storms in a given week for each area code, but it also indicates how many storms occurred that week. Having multiple severe storms, or even a larger storm system that contained more than one type of storm (e.g. Cold and Flood in the same week), could play a large role in determining the number of admits with an associated DRG. It is clear that some storms are major and other are minor relative to one another after viewing the weather data and seeing the range of Snow in one week. The weather data set gives more data including high and low average temperatures for the week in addition to the lowest and highest temperature of the entire week. Precipitation and Snow are contained in separate columns, and Snow must not be a subset of Precipitation as its values can be greater than Precipitation. The hours of daylight is also given. An initial hypothesis would be that greater hours of sunlight would lead to people venturing outdoors more and contributing less to the spread of an infection disease. Given the large range of temperatures, the data not only spans entire years but also encompasses a large variety of locations. The weather data is highly correlated due to each area_code having a separate record for each week. Adjacent weeks are expected to be highly correlated as well as having a seasonal factor throughout the year. Some data was missing, but it was very minimal. The calendar data set captures the data of the Sunday of each week, and the number of workdays and schooldays indicate how much interaction people in the community may be experiencing that week depending on their age. The summer weeks are easily spotted by having 0 schooldays. The area code will not be used as a predictor variable. We may want to generalize over all area codes and not just the ones given in this data set. Additionally, the information contained by the area code’s location and climate zone should be captured by the storm and weather data sets.
  • 5. The score data set contains information on area_code, DRG24, age_group and week number. The number of admits is blank and must be modeled. The weather and storm historical data should be carried through to help make inferences on the score data set, and it will be merged by the area_code and week. Data Mining Once all of the data sets were properly added to the SAS Enterprise Miner diagram with the necessary coding to append and merge where necessary along with some filtering and imputation, a 20% sample of the data was taken to build the model against. This still included hundreds of thousands of observations but allowed for quicker analytical processing time. Additionally, the predictors did not have any categorical variables that occurred very infrequently, so not too much information was lost. The sample was stratified using proportional criteria to ensure that a representative sample was chosen. This was important due to the fact that some of the response variable levels with a high number of admits did occur infrequently. A minimum strata size of five was applied. The model was build, validated and tested against this sample size, but the rules of the ultimately chosen model are applied in whole to the scoring data set. Figure 1 shows the Enterprise Miner Process Flow Diagram. The data mining begins after the final data merge. Figure 1: Enterprise Miner Process Flow Diagram
  • 6. Using the sample of the merged data sets, the data was partitioned using a Data Partition node into 60% Training, 20% Validation and 20% Test. Using the default of the node, the partitions were stratified on the Response variable of number of admits. While each data set will have different records, this stratification will keep them as close as randomly possible. The default random seed number 12345 was used in this node as well as all preceding and following nodes. The subsequent models will be built on the Training data set per the rules specified in those models. The Validation data set will be used to find the optimal number of steps or iterations in the model based on the chosen criteria. This keeps the model from overfitting the data. If only the Training data set was used, each additional rule of a model could increase its apparent fit to the data, but this model and all of its rules may not be applicable to future data sets. Applying this model is a large part of the expected outcome of this exercise. The Test data set is used separately as it is not involved in either the building of the model or the selection of the best model. The Test data set gives additional independent records for an unbiased measurement of the results of the model. The output from the Data Partition node was fed into model nodes including a Decision Tree analysis using Misclassification criteria, a Decision Tree analysis using Average Square Error and a Gradient Boosting model. The first Decision Tree selected its final model based on the Misclassification rate of the Validation data set. The decision tree in general determines its model not based on a mathematical equation but instead on a set of splitting rules that determines the most likely outcome of a record given all of the available predictors. First all of the data is grouped together and a rule is developed that splits the group into two sub-groups such that the difference between the groups is maximized. The maximization occurs by ultimately choosing certain values of one variable and putting them onto one side of the tree while placing the remaining observations on the other side of the tree after the calculations are performed on all variables and split locations. This process is then repeated for each sub-tree on the Training data set until one of the stopping rules is reached. Our response variable was number of admits and it was treated as a continuous variable. Therefore, the exact number of admits had to be modeled for the record to be properly identified, although only integers greater than or equal to 0 were valid choices. Because the response variable was interval, ProbF was used as the interval splitting rule criterion. The decision tree is able to handle missing values without eliminating the record by placing them all in the left side of the branch and modifying its splitting rules appropriately. The maximum branch size was 2 and the maximum branch depth was 6; these are both defaults of the Decision Tree node. The second Decision Tree selected its final model based on the Average Square Error of the Validation data set. This assessment criterion seems more appropriate as the larger problem is interested in determining which storms have the greatest impact on the number of admits with the ability of the model to make predictions. Choosing the exact number of admits for a given week is less important than making the most accurate predictions available. The final model will be selected using Average Square Error as the assessment criterion, but a Misclassification
  • 7. tree was also included to see how it compares. Because the response variable was interval, ProbF was used as the interval splitting rule criterion. The decision tree is able to handle missing values without eliminating the record by placing them all in the left side of the branch and modifying its splitting rules appropriately. The maximum branch size was 2 and the maximum branch depth was 6; these are both defaults of the Decision Tree node. The Gradient Boosting node uses tree boosting to create a series of decision trees that together form a single predictive model. A tree in the series is fit to the residual of the prediction from the earlier trees in the series where the residual is defined in terms of the derivative of a loss function. Boosting is a classification technique whereby the estimated probabilities are adjusted by weight estimates, and the weight estimates are increased when the previous model misclassified the response. The Gradient Boosting model in this diagram uses 50 iterations with a Shrinkage value of 0.10 to reduce the prediction of each tree and a Training proportion of 60% where a different training sample is taken for each iteration. The other defaults of the Gradient Boosting node were kept with the assessment measure being Average Square Error on the Validation data set. Results: The Misclassification decision tree had nearly 500,000 degrees of freedom. The inputs included 18 interval variables and 2 nominal variables. The final selected model had only four terminal leaves meaning there were only three splits. The first split was on Flood Storm count and the second split was on Cold Storm count. The third and final split was on Highest Temperature of the Week, with temperatures above 52.5 associated with higher admit rates (and the group size of the leaf equal to the minimum value of 5). Further iterations improved the misclassification rate of the training data set, but the misclassification rate of the validation data set reached its lowest value at four iterations and remained at the same rate with further iterations. Therefore, the simplest model (i.e. the one with the fewest iterations) with the best misclassification rate was chosen as the final model. The cumulative lift chart for this model is shown in Figure 2 below. Considering the top 10% of ordered cases, a lift of 10 indicates our model provides a 10 times more likely outcome of our response variable of number of admits. Figure 2: Cumulative Lift Chart for Misclassification Decision Tree
  • 8. The Average Square Error decision tree had an identical set-up to the Misclassification decision tree, except it had a different model-selecting criterion. The average square error was slightly lower and preferential to the Misclassification decision tree, although the two trees had the same misclassification rate on the validation data set. The final selected model had nine terminal leaves. Figure 3 shows that the best model was chosen after 9 iterations, and the decision would have been different if using misclassification rate. Figure 4 shows the Lift Chart; the entire lift benefit is achieved by the 10th percentile on the Training data set. The cumulative lift chart is nearly identical to that shown for the Misclassification decision tree, with a cumulative lift value of 10 at the 10th percentile. The first two splits in the Average Square Error decision tree matched those for the Misclassification decision tree: Flood Storm count and then Cold Storm count. For the areas with a higher flood storm count, larger populations above 3048.5 were much more likely to have a larger number of admits. Highest temperature of the week was again used as a splitting variable before Week Number was split on twice. Including the week number in the model may not contribute much to future predictions, but it could indicate trends as to whether a certain area is becoming more or less prone to the transfer of infectious diseases. Further down the tree model, additional rules were again made on splitting on the Flood Storm and Cold Storm data. These particular storms seem especially correlated with the number of admits in subsequent weeks.
  • 9. Figure 3: Model Iteration Plots for Average Square Error Decision Tree Figure 4: Lift Chart for Average Square Error Decision Tree The final model selected from the Gradient Boosting node was chosen based on average square error of the Validation data set as opposed to the default profit criterion. This model performed very similarly to the two decision tree models with a nearly identical cumulative lift chart, but the predictor variables were considerably different. The first and third most important variables were the DRG code and the Age Group, respectively. This indicates that there is much more involved than just the type of storm. There are strong interaction effects
  • 10. that determine the number of admits. Additionally, the only storm count that incurred any splitting was Wind Storm. Figure 5: Variable Importance for Gradient Boosting model Ration of Validation Splitting Validation to Training Interaction Variable Name Variable Label Rules Importance Importance Importance Importance DRG24 DRG24 317 1 1 1 0.05705945 WindStormCount 620 0.90436482 0.68292926 0.755147979 0.05235151 age_group 150 0.54379129 0.31783064 0.584471741 0.03751741 _NODEID_ 39 0.23469854 0.24760786 1.055003845 0.00276443 week week 374 0.43223949 0.20742442 0.479883087 0.00319923 Schooldays Schooldays 72 0.30115671 0.18227595 0.605252845 0.01177374 Imputed: Lowest IMP_MinLowT temperature of the 47 0.34987899 0.16313048 0.466248272 0.00187398 week Imputed: Average low IMP_AvgLowT temperature of the 13 0.10240737 0.05075145 0.495583996 NaN week Imputed: Highest IMP_MaxHighT temperature of the 72 0.15980206 0.04996011 0.312637462 NaN week IMP_week Imputed: week 79 0.18864154 0.01674544 0.088768586 NaN Imputed: Ddaylight of IMP_Daylight 0 0 0 NaN NaN the week Workdays Workdays 0 0 0 NaN NaN coldStormCount 0 0 0 NaN NaN ThunderStormCount 0 0 0 NaN NaN Imputed: Average IMP_AvgHighT high temperature of 0 0 0 NaN NaN the week WinterStormCount 0 0 0 NaN NaN Imputed: Inches of IMP_snow 0 0 0 NaN NaN snow FloodStormCount 0 0 0 NaN NaN population 0 0 0 NaN NaN Imputed: IMP_prcp 0 0 0 NaN NaN Precipitation The model comparison node selected the Gradient Boosting model as it had the lowest validation average square error value of 0.0246. The Average Square Error Decision Tree was second with a validation average square error value of 0.0249, and the Misclassification Decision Tree was not far behind at 0.0250. Timeout errors were received when trying to fit regression and neural node models; the root cause is unknown.
  • 11. Conclusions: After meticulous merging of the given data set with all available weather, storm, population, calendar and DRG parameters, the developed models achieved a low average square error value of less than 0.0250 using the number of admits as the response variable. The misclassification rate of all of the models was primarily based on the prevalence of the weeks with admits equal to 0; however, the low average square error of the selected model lends some confidence to making predictions. The Gradient Boosting model was chosen over the two Decision Tree models, although they all had similar response characteristics. The Wind, Flood and Cold storms were involved in the models, but Thunderstorm and Winter storms were not. Storms that affect people’s behavior and keep them indoors in close proximity to others are more likely to contribute to the spread of infectious diseases. In that light, wind storms can knock out electricity and cause severe damage, impairing travel and normal school and business functions. Floods and cold storms can likewise be major disruptions. On the other hand, people are accustomed to thunderstorms and winter storms and have methods of dealing with them while not greatly altering their normal lifestyle. The DRG code as well as the Age Group played significant roles in the chosen model. The interactions between these variables and the weather data are evident; certain events will affect some groups and not others. Additionally, some of the DRGs only apply to certain Age Groups, so the relationship is not surprising. For cold storms, the respiratory illnesses including bronchitis, pneumonia and respiratory infections were most positively correlated with an increase in cold storms while viral illnesses and fevers actually had a negative correlation. For floods, viral illness and fever as well as otitis media and URI had a positive correlation with an increase storm incidence. These relationships are built into the model, so the model had predictive capabilities as well as general modeling capabilities if planners want to see what their admission needs might be given certain scenarios.
  • 12. Appendix A: SAS Enterprise Miner 6.2 .XML Diagram Instructions In order to run the attached “Shootout_Team_SATURN.xml” file, open Enterprise Miner and create a new project. In the Project Start Code, add a libname titled “shootout” with a path to a directory containing the SAS2011_STORM dataset. Add the following 7 data sources to the project: CALENDAR_DATA (created from Calendar.xls), DRG_LIST_DATA (created from DRG_list.xls), SAS2011_ADMITS, SAS2011_POPULATION, SAS2011_POPULATION_2005, SAS2011_WEATHER, SCORE_DATA. All of the datasets were provided by the SAS Shootout 2011 project. Import the Shootout_Team_SATURN.xml diagram and run all nodes.