1. Readmission of Diabetes Patients
Project Report
by
Rahmawati Nusantari
Maria D. Marroquin
Essenam Kakpo
Hong Lu
Team 10
INSY 5339 – Th. 7 pm-9:50 pm
May 12, 2016
2. 2
Table of Contents
Problem Domain 3
Data Summary 3
Encounters 3
Features 3
Target Variable 6
Prediction 6
Data Cleaning Process 7
Data Cleaning Tools 7
Missing Values 7
Irrelevant Data 7
Data Imbalance 8
Past Cleaning Efforts 10
Discretization 10
Various SMOTE Percentages 10
Algorithms Utilized 11
Classifiers 11
Comparison of Bayes Classifiers 11
Factor Experimental Design 13
Number of Attributes 13
Noise 13
Experiments 14
Combination Sets with Each Classifier 14
Summary of Results 18
Analysis and Conclusion 20
ROC Curves 20
Additional Analysis 23
Overall Observations 23
References 26
3. 3
Problem Domain
Dataset Summary
The dataset was obtained from the UCI Machine Learning Repository. It is listed under the name
Diabetes 130 – US Hospitals. According to the dataset description, the data has been prepared to
analyze factors related to readmission as well as other outcomes pertaining to patients with
diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 U.S. hospitals and
integrated delivery networks. This dataset contains 101,766 unique inpatients encounters
(instances) with 50 attributes, making the size of this dataset a total of 5,088,300 cells.
Encounters (Records)
As stated on the UCI’s dataset information page, the dataset contains encounters that satisfied the
following criteria:
• It is an inpatient encounter (a hospital admission).
• It is a diabetic encounter, that is, one during which any kind of diabetes was entered to
the system as a diagnosis.
• The length of stay was at least 1 day and at most 14 days.
• Laboratory tests were performed during the encounter.
• Medications were administered during the encounter.
Features (Attributes)
The attributes represent patient and hospital outcomes. This data set mostly contains nominal
attributes such as medical specialty and gender, but also includes a few ordinal attributes such as
age and weight and continues attributes such as time(days) in hospital and number of medications.
The following table list each attribute, its description, and the percentage of missing information
pertaining to each attribute.
4. 4
Attributes and Target Variable Table
Feature
name Type Description and values
%
missing
Encounter
ID Numeric Unique identifier of an encounter 0%
Patient
number Numeric Unique identifier of a patient 0%
Race Nominal
Values: Caucasian, Asian, African American,
Hispanic, and other 2%
Gender Nominal Values: male, female, and unknown/invalid 0%
Age Nominal
Grouped in 10-year intervals: 0, 10), 10, 20), …, 90,
100) 0%
Weight Numeric Weight in pounds. 97%
Admission
type Nominal
Integer identifier corresponding to 9 distinct values,
for example, emergency, urgent, elective, newborn,
and not available 0%
Discharge
disposition Nominal
Integer identifier corresponding to 29 distinct values,
for example, discharged to home, expired, and not
available 0%
Admission
source Nominal
Integer identifier corresponding to 21 distinct values,
for example, physician referral, emergency room, and
transfer from a hospital 0%
Time in
hospital Numeric
Integer number of days between admission and
discharge 0%
Payer code Nominal
Integer identifier corresponding to 23 distinct values,
for example, Blue Cross/Blue Shield, Medicare, and
self-pay 52%
Medical
specialty Nominal
Integer identifier of a specialty of the admitting
physician, corresponding to 84 distinct values, for
example, cardiology, internal medicine,
family/general practice, and surgeon 53%
Number of
lab
procedures Numeric Number of lab tests performed during the encounter 0%
Number of
procedures Numeric
Number of procedures (other than lab tests) performed
during the encounter 0%
Number of
medications Numeric
Number of distinct generic names administered during
the encounter 0%
Number of
outpatient
visits Numeric
Number of outpatient visits of the patient in the year
preceding the encounter 0%
Number of
emergency
visits Numeric
Number of emergency visits of the patient in the year
preceding the encounter 0%
5. 5
Feature
name Type Description and values
%
missing
Number of
inpatient
visits Numeric
Number of inpatient visits of the patient in the year
preceding the encounter 0%
Diagnosis 1 Nominal
The primary diagnosis (coded as first three digits of
ICD9); 848 distinct values 0%
Diagnosis 2 Nominal
Secondary diagnosis (coded as first three digits of
ICD9); 923 distinct values 0%
Diagnosis 3 Nominal
Additional secondary diagnosis (coded as first three
digits of ICD9); 954 distinct values 1%
Number of
diagnoses Numeric Number of diagnoses entered to the system 0%
Glucose
serum test
result Nominal
Indicates the range of the result or if the test was not
taken. Values: “>200,” “>300,” “normal,” and “none”
if not measured 0%
A1c test
result Nominal
Indicates the range of the result or if the test was not
taken. Values: “>8” if the result was greater than 8%,
“>7” if the result was greater than 7% but less than
8%, “normal” if the result was less than 7%, and
“none” if not measured. 0%
Change of
medications Nominal
Indicates if there was a change in diabetic medications
(either dosage or generic name). Values: “change” and
“no change” 0%
Diabetes
medications Nominal
Indicates if there was any diabetic medication
prescribed. Values: “yes” and “no” 0%
24 features
for
medications Nominal
For the generic names: metformin, repaglinide,
nateglinide, chlorpropamide, glimepiride,
acetohexamide, glipizide, glyburide, tolbutamide,
pioglitazone, rosiglitazone, acarbose, miglitol,
troglitazone, tolazamide, examide, sitagliptin, insulin,
glyburide-metformin, glipizide-metformin,
glimepiride-pioglitazone, metformin-rosiglitazone,
and metformin-pioglitazone, the feature indicates
whether the drug was prescribed or there was a change
in the dosage. Values: “up” if the dosage was
increased during the encounter, “down” if the dosage
was decreased, “steady” if the dosage did not change,
and “no” if the drug was not prescribed 0%
Readmitted Nominal
Days to inpatient readmission. Values: “<30” if the
patient was readmitted in less than 30 days, “>30” if
the patient was readmitted in more than 30 days, and
“No” for no record of readmission. 0%
6. 6
Target Variable
The last attribute in the previous table is the class attribute, which in this case is Readmission.
The distribution of the class attribute is as follows:
• Encounters of patients who were not readmitted (No) to the hospital. There are 54, 864 of
such encounters.
• Encounters of patients who were readmitted to the hospital after 30 days of discharge (>30).
There are 35,545 of such encounters.
• Encounters of patients who were readmitted to the hospital within 30 days of discharge
(<30). There are 11, 357 of such encounters.
Prediction
We want to predict whether or when diabetes patients will be readmitted to the hospital based on
several factors (attributes).
>30
No
<30
Readmission
?
7. 7
Data Cleaning Process
Data cleaning is commonly defined as the process of detecting and correcting corrupt or inaccurate
records from a dataset, table, or database.1
Data quality is an important component in any data
mining efforts. For this reason, many data scientists spend from 50% to 80% of their time preparing
and cleaning their data before it can be mined for insights.2
There are four broad categories of data quality problems: missing data, abnormal data (outliers),
departure from models, and goodness-of-fit.3
For our project, our team mainly dealt with missing
data. Our team will also address the imbalance in the class variable using SMOTE.
Data Cleaning Tools
Our team utilized Microsoft Excel to perform the data cleaning. As our guidance to understand the
variables and meaning of the data, we consulted the research article that owned the data: “Impact
of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database
Patient Records” by Beata Strack et al.
Missing Values
The journal identified three attributes with the majority of their records missing such as weight
(97%), payer code (52%), and medical specialty (53%). Weight was not properly recorded since
this experiment was done prior to the HITECH legislation of the American Reinvestment and
Recovery Act in 2009, while payer code was deemed irrelevant by the researchers. As a result,
these 3 attributes were deleted.
There were also 23 attributes that had zero values in 79% to 99% of their records. Those are
medications features such as metformin and other generic medications. The zero value indicated
that the type of medication was not prescribed to the patient. As a result, all these 23 attributes
were deleted. However, insulin was the only medication attribute retained since it had more than
50% of data in its records, and it is considered prevalent in diabetic patient cases.
Irrelevant Data
The class attribute determines whether a patient is readmitted in the hospital within 30 days, over
30 days, or not readmitted at all. The attribute, discharge disposition, corresponds to 29 distinct
values that indicate patients are discharged to home or another hospital, to hospice for terminally-
ill patients, or indicate that the patients have passed away.
1
https://en.wikipedia.org/wiki/Data_cleansing
2
Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to
Insights.
3
Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004
8. 8
To correctly include only active (alive) patients and not in hospice, we removed records that had
Discharge Disposition codes of 11, 13, 14, 19, 20, and 21. These discharge codes matched the
instances of patients who were deceased or sent to hospice. This cleaning process removed 2,423
instances.
Data Imbalance
SMOTE (Synthetic Minority Oversampling Technique) is a filter that samples the data and alters
the class distribution. It can be used to adjust the relative frequency between the minority and
majority classes in the data. SMOTE does not under-sample the majority classes. Instead, it
oversamples the minority class by creating synthetic instances using a K-Nearest-Neighbor
approach. The user can specify the oversampling percentage and the number of neighbors to use
when creating synthetic instances.4
Our team applied SMOTE in different combinations and ultimately decided to apply a 200%
synthetic minority oversample with 3-nearest-neighbors as shown below.
SMOTE filter in WEKA
4
Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd
edition, Elsevier, 2011
9. 9
The following graphs and matrices represent the comparison of the data before the SMOTE and
after the 200% SMOTE applied to the minority class (<30).
Class Distribution Graphs Before and After 200% SMOTE
Confusion Matrices Before and After 200% SMOTE
Using J-48
Using BayesNet
Original
Data
SMOTE
200%
SMOTE
200%
SMOTE
200%
Original
Data
Original
Data
10. 10
Past Cleaning Efforts
Discretization
As part of our initial data cleaning efforts, we discretized several nominal attributes in integer
identifiers. Those attributes are diagnosis1, diagnosis2, diagnosis3, admission type, discharge
disposition, and admission source. The first three attributes (diagnoses) were coded based on ICD9
(International Statistical Code of Diseases and Related Health Problems). For example, code IDs
390-459 and 785 are diseases of the circulatory system. After converting all the integer identifiers
into nominal values, the results did not show significant improvement.
Various SMOTE Percentages
We also applied different SMOTE percentages mainly to the <30 minority class. However, there
was no significant improvement. We ultimately decided to apply a 200% increase on the <30
minority class as mentioned earlier.
Class Distribution Graphs with Different SMOTE Percentages
350%,
<30
350%
on
<30,
50%
on
>30
500%,
<30
250%,
<30
11. 11
Algorithms Utilized
After data cleaning and pre-processing, the selection of algorithms to run our experiments was
made. We selected three classifiers for the experiment design:
Classifiers
• J48. It works on the Decision Tree Learning process to find and optimize the most
efficient attribute which increases the prediction accuracy.
• Naïve Bayes. It takes a probabilistic approach to determine the attributes upon which a
model is to be built.
• Bayes Net. It is a probabilistic graphical model that represents a set of random variables
and their conditional dependencies through a directed acyclic graph.
Comparison of Classifiers
Since we selected two Bayes classifiers, we compared the difference between Naïve Bayes and
Bayes Net.
A Naive Bayes classifier is a simple model that describes a particular class of the Bayesian
network: all the features are conditionally independent of each other. Because of this, there are
certain problems that Naive Bayes cannot solve. An advantage of Naive Bayes is it only requires
a small amount of training data to estimate the parameters necessary for classification.
A Bayesian Net models relationships between features in a very general way. The Bayesian
Network does not have such assumptions. All the dependence in the Bayesian Network has to be
modeled. If it is known what these relationships are, or there is enough data to derive them, then
it may be appropriate to use a Bayesian Network.
The following are two examples that illustrate the differences between these two algorithms.
In the first example, a fruit may be considered to be an apple if it is red, round, and about 10
centimeters in diameter. A Naive Bayes classifier considers each of these features to contribute
independently to the probability that this fruit is an apple, regardless of any possible correlations
between the color, roundness, and diameter features.
12. 12
In the second example, presume that there are two events that could cause grass to be wet: either
the sprinkler is on, or it's raining. Also, presume that the rain has a direct effect on the use of the
sprinkler (i.e. when it rains, the sprinkler is usually not turned on). Then the situation can be
modeled with a Bayesian network. All three variables have two possible values, T (for true) and
F (for false). Two attributes (Sprinkler and Rain) are correlated.
13. 13
Factor Experimental Design
We selected two factors to conduct our experiment design: Number of Attributes and Noise.
Each of the combinations, as shown in the 2-Factor Experimental Design table below, were run
in an experiment with each algorithm utilizing 10 random seeds.
Number of Attributes
Due to the amount of attributes this dataset contained originally and even after it was cleaned, we
decided to analyze the effect of decreasing the number of attributes. We compared the results of
experimental runs with the full, cleaned data set versus the results with a reduced, cleaned data
set.
We utilized a tool found in Weka called InfoGain. This tool evaluates the worth of the attribute
by measuring the information gain with respect to the class. The output from this tool ranked all
attributes and we selected the top 10. We then compared experiment results from the dataset
containing 22 attributes versus the same dataset containing only the top 10 attributes.
Noise
Noise refers to the modification of original values such as a distortion in voice during a phone
call or fuzziness on a computer screen that can’t be seen clearly. We wanted to observe the effect
on classification performance by adding noise into our data.
Noise was selected as our second factor in the experiment design. We carefully added 10% of
noise only to the target variable, ran the experiments and compared the results to the dataset
without noise.
2-Factor Experimental Design Table
ALLATTRIBUTES SELECTED ATTRIBUTES
NO NOISE
C1
All Attributes & No Noise
C3
Selected Attributes & No Noise
NOISE (10%)
C2
All Attributes & Noise
C4
Selected Attributes & Noise
14. 14
Experiments
As previously mentioned, our experiment design was composed of two factors (Selected
Attributes and Noise), giving us four different sets of experiments to run:
• C1: All Attributes & No Noise
• C2: All Attributes & Noise
• C3: Selected Attributes & No Noise
• C4: Selected Attributes & Noise
Combination Sets with Each Classifier
C1
E1
Performance
of
J48
for
All
Attributes,
No
Noise
E2
Performance
of
Naïve
Bayes
for
All
Attributes,
No
Noise
E3
Performance
of
Bayes
Net
for
All
Attributes,
No
Noise
C2
E4
Performance
of
J48
for
All
Attributes,
10%
Noise
E5
Performance
of
Naïve
Bayes
for
All
Attributes,
10%
Noise
E6
Performance
of
Bayes
Net
for
All
Attributes,
10%
Noise
C3
E7
Performance
of
J48
for
Selected
Attributes,
No
Noise
E8
Performance
of
Naïve
Bayes
for
Selected
Attributes,
No
Noise
E9
Performance
of
Bayes
Net
for
Selected
Attributes,
No
Noise
C4
E10
Performance
of
J48
for
Selected
Attributes,
10%
Noise
E11
Performance
of
Naïve
Bayes
for
Selected
Attributes,
10%
Noise
E12
Performance
of
Bayes
Net
for
Selected
Attributes,
10%
Noise
Each of the experiments E1, E2,…, E12 was run 10 separate times with a different seed each
time, ensuring that the algorithm would use a slightly different training data set each time. For
each of the experiments, the percentage split was 66% training and 34% testing. For each C1-C4,
we use 3 different algorithms:
• Experiments E1, E4, E7, and E10 use the J48 algorithm.
• Experiments E2, E5, E8, and E11 use the Naives Bayes algorithm
• Experiments E3, E6, E9, and E12 use the Bayes Net algorithm.
The following tables are the results of the experiments conducted:
18. 18
Summary of Results
Results
of
Experiments
Graph
From the graph shown above, we can infer that C1 (All attributes & No Noise) is the best
experiment since it gives us the highest results across all algorithms. Next is C3 (Selected
Attributes & No Noise) which gives us slightly lesser results, but they are still significantly
higher than the results obtained from C2 (third best) and C4 (fourth best).
When it comes to the accuracy of the algorithms, Bayes Net leads with a significant margin over
J48 and Naïve Bayes across all four experiments. J48 comes in second position, performing up to
1.7% points higher than Naives Bayes across all experiments.
We can say the Selected Attributes that we considered for experiments in C3 and C4 are the most
relevant because, controlling for noise, the accuracy of the algorithms with all attributes declines
by less than 1.5% when switching to only the Selected Attributes.
C1,
56.74272
C2,
52.97854
C3,
55.50037
C4,
51.7537
C1,
57.66216
C2,
53.49725
C3,
57.19291
C4,
52.9832
64.22257
59.50229
63.85893
59.1443
51
53
55
57
59
61
63
65
C1 C2 C3 C4
Accuracy
Averages
(%)
NaiveBayes J48 BayesNet
19. 19
Standard
Errors
Graph
Looking at the different standard errors across experiments, we noticed that C1 stands out with
relatively high values and the highest mean standard error among all four experiments. The mean
standard error for C2, C3 and C4 are roughly the same.
The linear graph of the standard errors for each model shows different trends for each algorithm.
The area under the J48 curve form C1 to C4 is roughly similar to the one of Naïve Bayes, and as
far as size, they are both relatively high. BayesNet on the other hand, seems to have a slightly
lower standard error in general, with a smaller area under its curve. We can infer that Bayes Net
has the smallest standard error among all three algorithms used.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
C1 C2 C3 C4
Standard
Errors
NaiveBayes J48 BayesNet
20. 20
Analysis and Conclusion
In order to evaluate the performance of our algorithms, we will be using ROC curves.
ROC Curves
A receiver operating characteristic (ROC), or ROC curve, is defined as a graphical plot that
shows the performance of a binary classifier system as its discrimination threshold is varied. The
curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings. The true-positive rate is also known as sensitivity, or recall in machine
learning. The false-positive rate is also known as the fall-out and can be calculated as
(1 - specificity).
In order to determine which algorithm performs better, we will be looking at the tendency of
each curve. The closer the curve follows the Y-axis and then the top border, the larger the area is,
and the more accurate is the test.
In order to get a plot of the curves, we made use of Weka’s Knowledge Flow tool. We loaded the
following workflow through which we derived the curves:
Knowledge Flow
The above Knowledge Flow merely loads the specified file, assigns which attribute is considered
as the class, then chooses which class value to plot the curve for, and then allows us to select a
percentage split. The three algorithms are then run on the dataset with the parameters selected,
their performance is recorded, and the results are used to plot the ROC curve.
21. 21
The following are ROC curves for each experiment set when the class value is NO.
ROC Curve Graphs When Readmission is NO
C1
C2
22. 22
ROC Curve Graphs When Readmission is NO
C3
C4
From the above graphs, we observe that the area under the curve is greater for Naïve Bayes, in
each of the four experiment sets, C1, C2, C3 and C4. We can therefore conclude that Naïve
Bayes is more accurate than the other algorithms, according to these ROC curves.
23. 23
Additional Analysis
Let’s take a look at the confusion matrices we obtained from C1 and C3 (which give us the most
relevant models)
Confusion Matrix Tables for C1 and C3
We can conclude from these confusion matrices that Bayes Net gives a higher percentage of
True Positives across class values <30, >30 and NO. Bayes Net therefore appears to be the best
predictor from the point of view of our confusion matrices.
Overall Observations
Considering the average accuracy, average standard deviation, ROC curves, attributes, and
classifier evaluation, we recommend the following for the Readmission of Diabetes Patients
dataset:
Class balancing - SMOTE increased overall model accuracy (see SMOTE Comparison Matrices
below)
Classifier - Bayes Net gives the highest accuracy. Naive Bayesian classifier (NBC) assumes
independence between all attributes given a class, which is seldom true. That is why it is called
“Naïve”. In contrast, in a Bayesian network you can make a more detailed (true) model of the
problem using several layers of dependencies. It can track the cause-effect relationship among
attributes and class, and at the same time calculate and draw the probabilistic graph.
Attributes factor: Using All attributes instead of Top 10 has the highest accuracy.
24. 24
SMOTE Comparison Matrices
u Original Data using J48
=== Confusion Matrix ===
a b c <-- classified as
15914 2664 117 | a = NO
8222 3728 141 | b = >30
2396 1371 47 | c = <30
u After SMOTE 200% using J48
=== Confusion Matrix ===
a b c <-- classified as
13827 2683 1231 | a = NO
7596 3282 1331 | b = >30
3427 1433 6660 | c = <30
u Original Data using Naives Bayes
=== Confusion Matrix ===
a b c <-- classified as
16009 2138 548 | a = NO
8230 3168 693 | b = >30
2430 927 457 | c = <30
u After SMOTE 200% using Naives Bayes
=== Confusion Matrix ===
a b c <-- classified as
15294 1439 1008 | a = NO
8445 2092 1672 | b = >30
4721 818 5981 | c = <30
25. 25
u Original Data using BayesNet
=== Confusion Matrix ===
a b c <-- classified as
13302 4867 526 | a = NO
5138 6440 513 | b = >30
1662 1800 352 | c = <30
u After SMOTE 200% using BayesNet
=== Confusion Matrix ===
a b c <-- classified as
12746 4934 61 | a = NO
5802 6312 95 | b = >30
1714 2063 7743 | c = <30
26. 26
References
• Beata Strack et al., “Impact of HbA1c Measurement on Hospital Readmission Rates:
Analysis of 70,000 Clinical Database Patient Records.”
• Hindawi Publishing Corporation -
http://www.hindawi.com/journals/bmri/2014/781670/tab1/
• Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools
and Techniques, 3rd
edition, Elsevier, 2011
• Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-
US+hospitals+for+years+1999-2008#
• Naïve Bayes for Dummies - http://blog.aylien.com/post/120703930533/naive-bayes-for-
dummies-a-simple-explanation
• Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor
Work’ is Key Hurdle to Insights.
• Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality,
Wiley, 2004.
• Wikipedia - https://en.wikipedia.org/wiki/Data_cleansing
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
https://en.wikipedia.org/wiki/Bayesian_network