Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Sas Shootout Team Report
1. STAT 656 – Spring 2011
Shootout
Group #1 (Team SATURN)
Storm Impacts on
Infectious Disease
Propagation
Prepared By:
Charles Gordon, Abigail Green, Uday Hejmadi, Prabu
Krishnamurthy, Bhargava Lakkaraju, Deepthi Uppalapati,
Li Yun Zhang
May 9, 2011
2. Executive Summary:
Even in the 21st century, weather plays a large role in people’s individual lives as well as in the
community in which they live. Weather events can affect the propagation of infectious
diseases by causing people to spend more time indoors in close contact with each other. Given
data on the number of health facility admissions by Diagnostic Related Group (DRG) infectious
diseases each week in various area codes for several years along with storm and weather data
for the preceding weeks, while factoring in the minimum and maximum incubation periods for
all specified DRGs, a model was built to find relationships between the predictor variables and
the number of admits. The analysis shows that some storms (Flood, Cold and Wind) can cause
a statistically significant increase in the number of admissions, while other storms (Winter and
Thunderstorms) do not play a decisive role. However, the storm and weather data have
meaningful interactions with Age Groups, that correspond to varying phases of biological
development, as well as the area’s population and the different DRGs themselves. Using the
selected model, healthcare management can forecast variation in healthcare usage and plan
accordingly.
Appendix A details the necessary steps to import the attached SAS Enterprise Miner .xml
diagram and repeat the analysis.
Introduction / Problem Statement:
Many factors impact the spread of disease. In this project, we analyze the impact of storms on
the incidence of infectious disease. Weather is believed to have direct impacts such as injuries,
drowning, freezing, exhaustion and dehydration as well as indirect impacts when people’s
behaviors are changed. For example, weather patterns affect how people congregate and as a
result the storms affect the rate of propagation of an infection through a community. The
presumption is that the more time people spend indoors near each other the more likely a
disease is likely to spread.
First, we will determine the types of storms (Wind, Thunderstorm, Flood, Winter storm, Cold)
that have a statistically significant impact on certain infectious diseases. Second, we will
propose a model that predicts the incidence of certain infectious diseases based on Diagnostic
Related Group (DRG), age-group, area code and week. The model could then be used to help
healthcare providers prepare for fluctuations in patient needs throughout the different weeks
of the year based on predicted patient numbers and diseases.
Data Preparation:
Several datasets were provided. An admits dataset provided the number of people admitted
for a specific infectious disease by area code, age-group, and week. The dataset was assumed
to be complete. The admits table (after changing the column title from “DRG_code” to
“DRG24” for compatibility) was joined to the DRG table, which provided the minimum and
maximum weeks required for the incubation of a given disease. In the description of several
3. DRGs there was a specified age-range for the DRG, so only populations in the described age-
range could be diagnosed with the DRG. The tables were joined based on DRG code. Then two
new columns were created, minimum and maximum storm allowable storm week, so that
storm data could be joined to the admits data. The minimum allowable storm week was
calculated by subtracting the maximum incubation period from the week that people were
admitted, and the maximum allowable storm week was calculated by subtracting the minimum
incubation period.
These datasets were then joined to the storm data, which provided the number of storms of a
given type (Wind, Thunderstorm, Flood, Winter, and Cold) that occurred in a given area code
and week. Storms were joined to admits by counting each type of storm that occurred during
the DRG-specific incubation period prior to the observed week.
The weather table was joined to the storm table to try and determine if any storms may
possibly be missing from the storm dataset. First, several of the area codes had multiple
observations in one week. These observations were very similar, so the mean of all the
observations was used to condense the multiple observations into one per week. The storm
and weather datasets were joined on area code and week. There were very few storm
observations with a count of zero, so great care would need to be taken when modeling
additional storms. The weather data was filtered to remove values of week > 418 as no
response data was available for the number of admits on weeks beginning with 419, and
missing values of the predictor variables were imputed with the Tree method for interval
variables. No indicator variables were created. The number of missing values ranged from 17
to 42 observations.
The two datasets “sas2011_population” and “sas2011_population_2005” were appended to
create the entire population data set, and the “age” column was changed to “age_group” with
more clearly defined values that matched the descriptions in the admits table. The population
data was merged with the admits data (that already contains DRG and storm data) by area-
code, age-group and week.
The calendar dataset was filtered to remove values of week > 418 and then merged to the
admits data set that was filtered to remove missing responses. It was merged by week, area-
code and age group. The calendar data provides the number of workdays and schooldays in a
given calendar week. The number of workdays and schooldays may impact the ability of a
storm to impact the spread of disease because it is assumed that there may be more human
contact at school and work to encourage the spread of disease. The calendar data at the time
of the storm impacts the potential for the spread of an infectious disease, and its impact may
vary by age group.
Data Exploration
This problem includes data on 23 separate diagnostic related group (DRG) codes, although
some of them are related. For example, there are three separate DRG codes for pneumonia
4. with the differences being with or without “CC” and also 0-17 versus 17+. The minimum
incubation period for all DRGs is 1 week with a max incubation period of 2-4 weeks.
The fictitious admits data has a separate record for each week number depending on the area
code, DRG code and age_group along with the response value of number of admits. Over 73%
of the records had 0 admits during that week with an average of 0.289. Only 0.4% of the
records had more than 2 admits during the week, with a maximum value of 14 that occurred on
two separate occasions. With a quick glance at the data sorted by number of admits, the same
predictors of Area_Code = URQ80YY, DRG24 = 89, and age_group = 65+ are heavily
concentrated at the high range of number of admits.
The population data is heavily right-skewed 90% of the age_group-specific populations in a
given area code are less than 5,000 although some groups do contain values of greater than
20,000. The median is just over 500 while the mean is over 1,500.
The storm dataset not only breaks down the types of storms in a given week for each area
code, but it also indicates how many storms occurred that week. Having multiple severe
storms, or even a larger storm system that contained more than one type of storm (e.g. Cold
and Flood in the same week), could play a large role in determining the number of admits with
an associated DRG. It is clear that some storms are major and other are minor relative to one
another after viewing the weather data and seeing the range of Snow in one week.
The weather data set gives more data including high and low average temperatures for the
week in addition to the lowest and highest temperature of the entire week. Precipitation and
Snow are contained in separate columns, and Snow must not be a subset of Precipitation as its
values can be greater than Precipitation. The hours of daylight is also given. An initial
hypothesis would be that greater hours of sunlight would lead to people venturing outdoors
more and contributing less to the spread of an infection disease. Given the large range of
temperatures, the data not only spans entire years but also encompasses a large variety of
locations. The weather data is highly correlated due to each area_code having a separate
record for each week. Adjacent weeks are expected to be highly correlated as well as having a
seasonal factor throughout the year. Some data was missing, but it was very minimal.
The calendar data set captures the data of the Sunday of each week, and the number of
workdays and schooldays indicate how much interaction people in the community may be
experiencing that week depending on their age. The summer weeks are easily spotted by
having 0 schooldays.
The area code will not be used as a predictor variable. We may want to generalize over all area
codes and not just the ones given in this data set. Additionally, the information contained by
the area code’s location and climate zone should be captured by the storm and weather data
sets.
5. The score data set contains information on area_code, DRG24, age_group and week number.
The number of admits is blank and must be modeled. The weather and storm historical data
should be carried through to help make inferences on the score data set, and it will be merged
by the area_code and week.
Data Mining
Once all of the data sets were properly added to the SAS Enterprise Miner diagram with the
necessary coding to append and merge where necessary along with some filtering and
imputation, a 20% sample of the data was taken to build the model against. This still included
hundreds of thousands of observations but allowed for quicker analytical processing time.
Additionally, the predictors did not have any categorical variables that occurred very
infrequently, so not too much information was lost. The sample was stratified using
proportional criteria to ensure that a representative sample was chosen. This was important
due to the fact that some of the response variable levels with a high number of admits did
occur infrequently. A minimum strata size of five was applied. The model was build, validated
and tested against this sample size, but the rules of the ultimately chosen model are applied in
whole to the scoring data set. Figure 1 shows the Enterprise Miner Process Flow Diagram. The
data mining begins after the final data merge.
Figure 1: Enterprise Miner Process Flow Diagram
6. Using the sample of the merged data sets, the data was partitioned using a Data Partition node
into 60% Training, 20% Validation and 20% Test. Using the default of the node, the partitions
were stratified on the Response variable of number of admits. While each data set will have
different records, this stratification will keep them as close as randomly possible. The default
random seed number 12345 was used in this node as well as all preceding and following nodes.
The subsequent models will be built on the Training data set per the rules specified in those
models. The Validation data set will be used to find the optimal number of steps or iterations in
the model based on the chosen criteria. This keeps the model from overfitting the data. If only
the Training data set was used, each additional rule of a model could increase its apparent fit to
the data, but this model and all of its rules may not be applicable to future data sets. Applying
this model is a large part of the expected outcome of this exercise. The Test data set is used
separately as it is not involved in either the building of the model or the selection of the best
model. The Test data set gives additional independent records for an unbiased measurement
of the results of the model.
The output from the Data Partition node was fed into model nodes including a Decision Tree
analysis using Misclassification criteria, a Decision Tree analysis using Average Square Error and
a Gradient Boosting model.
The first Decision Tree selected its final model based on the Misclassification rate of the
Validation data set. The decision tree in general determines its model not based on a
mathematical equation but instead on a set of splitting rules that determines the most likely
outcome of a record given all of the available predictors. First all of the data is grouped
together and a rule is developed that splits the group into two sub-groups such that the
difference between the groups is maximized. The maximization occurs by ultimately choosing
certain values of one variable and putting them onto one side of the tree while placing the
remaining observations on the other side of the tree after the calculations are performed on all
variables and split locations. This process is then repeated for each sub-tree on the Training
data set until one of the stopping rules is reached. Our response variable was number of
admits and it was treated as a continuous variable. Therefore, the exact number of admits had
to be modeled for the record to be properly identified, although only integers greater than or
equal to 0 were valid choices. Because the response variable was interval, ProbF was used as
the interval splitting rule criterion. The decision tree is able to handle missing values without
eliminating the record by placing them all in the left side of the branch and modifying its
splitting rules appropriately. The maximum branch size was 2 and the maximum branch depth
was 6; these are both defaults of the Decision Tree node.
The second Decision Tree selected its final model based on the Average Square Error of the
Validation data set. This assessment criterion seems more appropriate as the larger problem is
interested in determining which storms have the greatest impact on the number of admits with
the ability of the model to make predictions. Choosing the exact number of admits for a given
week is less important than making the most accurate predictions available. The final model
will be selected using Average Square Error as the assessment criterion, but a Misclassification
7. tree was also included to see how it compares. Because the response variable was interval,
ProbF was used as the interval splitting rule criterion. The decision tree is able to handle
missing values without eliminating the record by placing them all in the left side of the branch
and modifying its splitting rules appropriately. The maximum branch size was 2 and the
maximum branch depth was 6; these are both defaults of the Decision Tree node.
The Gradient Boosting node uses tree boosting to create a series of decision trees that together
form a single predictive model. A tree in the series is fit to the residual of the prediction from
the earlier trees in the series where the residual is defined in terms of the derivative of a loss
function. Boosting is a classification technique whereby the estimated probabilities are
adjusted by weight estimates, and the weight estimates are increased when the previous model
misclassified the response. The Gradient Boosting model in this diagram uses 50 iterations with
a Shrinkage value of 0.10 to reduce the prediction of each tree and a Training proportion of
60% where a different training sample is taken for each iteration. The other defaults of the
Gradient Boosting node were kept with the assessment measure being Average Square Error on
the Validation data set.
Results:
The Misclassification decision tree had nearly 500,000 degrees of freedom. The inputs included
18 interval variables and 2 nominal variables. The final selected model had only four terminal
leaves meaning there were only three splits. The first split was on Flood Storm count and the
second split was on Cold Storm count. The third and final split was on Highest Temperature of
the Week, with temperatures above 52.5 associated with higher admit rates (and the group size
of the leaf equal to the minimum value of 5). Further iterations improved the misclassification
rate of the training data set, but the misclassification rate of the validation data set reached its
lowest value at four iterations and remained at the same rate with further iterations.
Therefore, the simplest model (i.e. the one with the fewest iterations) with the best
misclassification rate was chosen as the final model. The cumulative lift chart for this model is
shown in Figure 2 below. Considering the top 10% of ordered cases, a lift of 10 indicates our
model provides a 10 times more likely outcome of our response variable of number of admits.
Figure 2: Cumulative Lift Chart for Misclassification Decision Tree
8. The Average Square Error decision tree had an identical set-up to the Misclassification decision
tree, except it had a different model-selecting criterion. The average square error was slightly
lower and preferential to the Misclassification decision tree, although the two trees had the
same misclassification rate on the validation data set. The final selected model had nine
terminal leaves. Figure 3 shows that the best model was chosen after 9 iterations, and the
decision would have been different if using misclassification rate. Figure 4 shows the Lift Chart;
the entire lift benefit is achieved by the 10th percentile on the Training data set. The cumulative
lift chart is nearly identical to that shown for the Misclassification decision tree, with a
cumulative lift value of 10 at the 10th percentile.
The first two splits in the Average Square Error decision tree matched those for the
Misclassification decision tree: Flood Storm count and then Cold Storm count. For the areas
with a higher flood storm count, larger populations above 3048.5 were much more likely to
have a larger number of admits. Highest temperature of the week was again used as a splitting
variable before Week Number was split on twice. Including the week number in the model may
not contribute much to future predictions, but it could indicate trends as to whether a certain
area is becoming more or less prone to the transfer of infectious diseases. Further down the
tree model, additional rules were again made on splitting on the Flood Storm and Cold Storm
data. These particular storms seem especially correlated with the number of admits in
subsequent weeks.
9. Figure 3: Model Iteration Plots for Average Square Error Decision Tree
Figure 4: Lift Chart for Average Square Error Decision Tree
The final model selected from the Gradient Boosting node was chosen based on average square
error of the Validation data set as opposed to the default profit criterion. This model
performed very similarly to the two decision tree models with a nearly identical cumulative lift
chart, but the predictor variables were considerably different. The first and third most
important variables were the DRG code and the Age Group, respectively. This indicates that
there is much more involved than just the type of storm. There are strong interaction effects
10. that determine the number of admits. Additionally, the only storm count that incurred any
splitting was Wind Storm.
Figure 5: Variable Importance for Gradient Boosting model
Ration of
Validation
Splitting Validation to Training Interaction
Variable Name Variable Label Rules Importance Importance Importance Importance
DRG24 DRG24 317 1 1 1 0.05705945
WindStormCount 620 0.90436482 0.68292926 0.755147979 0.05235151
age_group 150 0.54379129 0.31783064 0.584471741 0.03751741
_NODEID_ 39 0.23469854 0.24760786 1.055003845 0.00276443
week week 374 0.43223949 0.20742442 0.479883087 0.00319923
Schooldays Schooldays 72 0.30115671 0.18227595 0.605252845 0.01177374
Imputed: Lowest
IMP_MinLowT temperature of the 47 0.34987899 0.16313048 0.466248272 0.00187398
week
Imputed: Average low
IMP_AvgLowT temperature of the 13 0.10240737 0.05075145 0.495583996 NaN
week
Imputed: Highest
IMP_MaxHighT temperature of the 72 0.15980206 0.04996011 0.312637462 NaN
week
IMP_week Imputed: week 79 0.18864154 0.01674544 0.088768586 NaN
Imputed: Ddaylight of
IMP_Daylight 0 0 0 NaN NaN
the week
Workdays Workdays 0 0 0 NaN NaN
coldStormCount 0 0 0 NaN NaN
ThunderStormCount 0 0 0 NaN NaN
Imputed: Average
IMP_AvgHighT high temperature of 0 0 0 NaN NaN
the week
WinterStormCount 0 0 0 NaN NaN
Imputed: Inches of
IMP_snow 0 0 0 NaN NaN
snow
FloodStormCount 0 0 0 NaN NaN
population 0 0 0 NaN NaN
Imputed:
IMP_prcp 0 0 0 NaN NaN
Precipitation
The model comparison node selected the Gradient Boosting model as it had the lowest
validation average square error value of 0.0246. The Average Square Error Decision Tree was
second with a validation average square error value of 0.0249, and the Misclassification
Decision Tree was not far behind at 0.0250. Timeout errors were received when trying to fit
regression and neural node models; the root cause is unknown.
11. Conclusions:
After meticulous merging of the given data set with all available weather, storm, population,
calendar and DRG parameters, the developed models achieved a low average square error
value of less than 0.0250 using the number of admits as the response variable. The
misclassification rate of all of the models was primarily based on the prevalence of the weeks
with admits equal to 0; however, the low average square error of the selected model lends
some confidence to making predictions.
The Gradient Boosting model was chosen over the two Decision Tree models, although they all
had similar response characteristics. The Wind, Flood and Cold storms were involved in the
models, but Thunderstorm and Winter storms were not. Storms that affect people’s behavior
and keep them indoors in close proximity to others are more likely to contribute to the spread
of infectious diseases. In that light, wind storms can knock out electricity and cause severe
damage, impairing travel and normal school and business functions. Floods and cold storms
can likewise be major disruptions. On the other hand, people are accustomed to
thunderstorms and winter storms and have methods of dealing with them while not greatly
altering their normal lifestyle.
The DRG code as well as the Age Group played significant roles in the chosen model. The
interactions between these variables and the weather data are evident; certain events will
affect some groups and not others. Additionally, some of the DRGs only apply to certain Age
Groups, so the relationship is not surprising. For cold storms, the respiratory illnesses including
bronchitis, pneumonia and respiratory infections were most positively correlated with an
increase in cold storms while viral illnesses and fevers actually had a negative correlation. For
floods, viral illness and fever as well as otitis media and URI had a positive correlation with an
increase storm incidence. These relationships are built into the model, so the model had
predictive capabilities as well as general modeling capabilities if planners want to see what their
admission needs might be given certain scenarios.
12. Appendix A: SAS Enterprise Miner 6.2 .XML Diagram Instructions
In order to run the attached “Shootout_Team_SATURN.xml” file, open Enterprise Miner and
create a new project. In the Project Start Code, add a libname titled “shootout” with a path to
a directory containing the SAS2011_STORM dataset. Add the following 7 data sources to the
project: CALENDAR_DATA (created from Calendar.xls), DRG_LIST_DATA (created from
DRG_list.xls), SAS2011_ADMITS, SAS2011_POPULATION, SAS2011_POPULATION_2005,
SAS2011_WEATHER, SCORE_DATA. All of the datasets were provided by the SAS Shootout 2011
project. Import the Shootout_Team_SATURN.xml diagram and run all nodes.