SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Readmission of Diabetes Patients
Project Report
by
Rahmawati Nusantari
Maria D. Marroquin
Essenam Kakpo
Hong Lu
Team 10
INSY 5339 – Th. 7 pm-9:50 pm
May 12, 2016
  2	
  
Table of Contents
Problem Domain 3
Data Summary 3
Encounters 3
Features 3
Target Variable 6
Prediction 6
Data Cleaning Process 7
Data Cleaning Tools 7
Missing Values 7
Irrelevant Data 7
Data Imbalance 8
Past Cleaning Efforts 10
Discretization 10
Various SMOTE Percentages 10
Algorithms Utilized 11
Classifiers 11
Comparison of Bayes Classifiers 11
Factor Experimental Design 13
Number of Attributes 13
Noise 13
Experiments 14
Combination Sets with Each Classifier 14
Summary of Results 18
Analysis and Conclusion 20
ROC Curves 20
Additional Analysis 23
Overall Observations 23
References 26
  3	
  
Problem Domain
Dataset Summary
The dataset was obtained from the UCI Machine Learning Repository. It is listed under the name
Diabetes 130 – US Hospitals. According to the dataset description, the data has been prepared to
analyze factors related to readmission as well as other outcomes pertaining to patients with
diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 U.S. hospitals and
integrated delivery networks. This dataset contains 101,766 unique inpatients encounters
(instances) with 50 attributes, making the size of this dataset a total of 5,088,300 cells.
Encounters (Records)
As stated on the UCI’s dataset information page, the dataset contains encounters that satisfied the
following criteria:
•   It is an inpatient encounter (a hospital admission).
•   It is a diabetic encounter, that is, one during which any kind of diabetes was entered to
the system as a diagnosis.
•   The length of stay was at least 1 day and at most 14 days.
•   Laboratory tests were performed during the encounter.
•   Medications were administered during the encounter.
Features (Attributes)
The attributes represent patient and hospital outcomes. This data set mostly contains nominal
attributes such as medical specialty and gender, but also includes a few ordinal attributes such as
age and weight and continues attributes such as time(days) in hospital and number of medications.
The following table list each attribute, its description, and the percentage of missing information
pertaining to each attribute.
  4	
  
Attributes and Target Variable Table
Feature
name Type Description and values
%
missing
Encounter
ID Numeric Unique identifier of an encounter 0%
Patient
number Numeric Unique identifier of a patient 0%
Race Nominal
Values: Caucasian, Asian, African American,
Hispanic, and other 2%
Gender Nominal Values: male, female, and unknown/invalid 0%
Age Nominal
Grouped in 10-year intervals: 0, 10), 10, 20), …, 90,
100) 0%
Weight Numeric Weight in pounds. 97%
Admission
type Nominal
Integer identifier corresponding to 9 distinct values,
for example, emergency, urgent, elective, newborn,
and not available 0%
Discharge
disposition Nominal
Integer identifier corresponding to 29 distinct values,
for example, discharged to home, expired, and not
available 0%
Admission
source Nominal
Integer identifier corresponding to 21 distinct values,
for example, physician referral, emergency room, and
transfer from a hospital 0%
Time in
hospital Numeric
Integer number of days between admission and
discharge 0%
Payer code Nominal
Integer identifier corresponding to 23 distinct values,
for example, Blue Cross/Blue Shield, Medicare, and
self-pay 52%
Medical
specialty Nominal
Integer identifier of a specialty of the admitting
physician, corresponding to 84 distinct values, for
example, cardiology, internal medicine,
family/general practice, and surgeon 53%
Number of
lab
procedures Numeric Number of lab tests performed during the encounter 0%
Number of
procedures Numeric
Number of procedures (other than lab tests) performed
during the encounter 0%
Number of
medications Numeric
Number of distinct generic names administered during
the encounter 0%
Number of
outpatient
visits Numeric
Number of outpatient visits of the patient in the year
preceding the encounter 0%
Number of
emergency
visits Numeric
Number of emergency visits of the patient in the year
preceding the encounter 0%
  5	
  
Feature
name Type Description and values
%
missing
Number of
inpatient
visits Numeric
Number of inpatient visits of the patient in the year
preceding the encounter 0%
Diagnosis 1 Nominal
The primary diagnosis (coded as first three digits of
ICD9); 848 distinct values 0%
Diagnosis 2 Nominal
Secondary diagnosis (coded as first three digits of
ICD9); 923 distinct values 0%
Diagnosis 3 Nominal
Additional secondary diagnosis (coded as first three
digits of ICD9); 954 distinct values 1%
Number of
diagnoses Numeric Number of diagnoses entered to the system 0%
Glucose
serum test
result Nominal
Indicates the range of the result or if the test was not
taken. Values: “>200,” “>300,” “normal,” and “none”
if not measured 0%
A1c test
result Nominal
Indicates the range of the result or if the test was not
taken. Values: “>8” if the result was greater than 8%,
“>7” if the result was greater than 7% but less than
8%, “normal” if the result was less than 7%, and
“none” if not measured. 0%
Change of
medications Nominal
Indicates if there was a change in diabetic medications
(either dosage or generic name). Values: “change” and
“no change” 0%
Diabetes
medications Nominal
Indicates if there was any diabetic medication
prescribed. Values: “yes” and “no” 0%
24 features
for
medications Nominal
For the generic names: metformin, repaglinide,
nateglinide, chlorpropamide, glimepiride,
acetohexamide, glipizide, glyburide, tolbutamide,
pioglitazone, rosiglitazone, acarbose, miglitol,
troglitazone, tolazamide, examide, sitagliptin, insulin,
glyburide-metformin, glipizide-metformin,
glimepiride-pioglitazone, metformin-rosiglitazone,
and metformin-pioglitazone, the feature indicates
whether the drug was prescribed or there was a change
in the dosage. Values: “up” if the dosage was
increased during the encounter, “down” if the dosage
was decreased, “steady” if the dosage did not change,
and “no” if the drug was not prescribed 0%
Readmitted Nominal
Days to inpatient readmission. Values: “<30” if the
patient was readmitted in less than 30 days, “>30” if
the patient was readmitted in more than 30 days, and
“No” for no record of readmission. 0%
  6	
  
Target Variable
The last attribute in the previous table is the class attribute, which in this case is Readmission.
The distribution of the class attribute is as follows:
•   Encounters of patients who were not readmitted (No) to the hospital. There are 54, 864 of
such encounters.
•   Encounters of patients who were readmitted to the hospital after 30 days of discharge (>30).
There are 35,545 of such encounters.
•   Encounters of patients who were readmitted to the hospital within 30 days of discharge
(<30). There are 11, 357 of such encounters.
Prediction
We want to predict whether or when diabetes patients will be readmitted to the hospital based on
several factors (attributes).
>30
No
<30
Readmission
?
  7	
  
Data Cleaning Process
Data cleaning is commonly defined as the process of detecting and correcting corrupt or inaccurate
records from a dataset, table, or database.1
Data quality is an important component in any data
mining efforts. For this reason, many data scientists spend from 50% to 80% of their time preparing
and cleaning their data before it can be mined for insights.2
There are four broad categories of data quality problems: missing data, abnormal data (outliers),
departure from models, and goodness-of-fit.3
For our project, our team mainly dealt with missing
data. Our team will also address the imbalance in the class variable using SMOTE.
Data Cleaning Tools
Our team utilized Microsoft Excel to perform the data cleaning. As our guidance to understand the
variables and meaning of the data, we consulted the research article that owned the data: “Impact
of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database
Patient Records” by Beata Strack et al.
Missing Values
The journal identified three attributes with the majority of their records missing such as weight
(97%), payer code (52%), and medical specialty (53%). Weight was not properly recorded since
this experiment was done prior to the HITECH legislation of the American Reinvestment and
Recovery Act in 2009, while payer code was deemed irrelevant by the researchers. As a result,
these 3 attributes were deleted.
There were also 23 attributes that had zero values in 79% to 99% of their records. Those are
medications features such as metformin and other generic medications. The zero value indicated
that the type of medication was not prescribed to the patient. As a result, all these 23 attributes
were deleted. However, insulin was the only medication attribute retained since it had more than
50% of data in its records, and it is considered prevalent in diabetic patient cases.
Irrelevant Data
The class attribute determines whether a patient is readmitted in the hospital within 30 days, over
30 days, or not readmitted at all. The attribute, discharge disposition, corresponds to 29 distinct
values that indicate patients are discharged to home or another hospital, to hospice for terminally-
ill patients, or indicate that the patients have passed away.
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
1
https://en.wikipedia.org/wiki/Data_cleansing
2
Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to
Insights.
3
Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004
  8	
  
To correctly include only active (alive) patients and not in hospice, we removed records that had
Discharge Disposition codes of 11, 13, 14, 19, 20, and 21. These discharge codes matched the
instances of patients who were deceased or sent to hospice. This cleaning process removed 2,423
instances.
Data Imbalance
SMOTE (Synthetic Minority Oversampling Technique) is a filter that samples the data and alters
the class distribution. It can be used to adjust the relative frequency between the minority and
majority classes in the data. SMOTE does not under-sample the majority classes. Instead, it
oversamples the minority class by creating synthetic instances using a K-Nearest-Neighbor
approach. The user can specify the oversampling percentage and the number of neighbors to use
when creating synthetic instances.4
Our team applied SMOTE in different combinations and ultimately decided to apply a 200%
synthetic minority oversample with 3-nearest-neighbors as shown below.
SMOTE filter in WEKA
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
4
Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd
edition, Elsevier, 2011
  9	
  
The following graphs and matrices represent the comparison of the data before the SMOTE and
after the 200% SMOTE applied to the minority class (<30).
Class Distribution Graphs Before and After 200% SMOTE
Confusion Matrices Before and After 200% SMOTE
Using J-48
Using BayesNet	
  	
  
Original	
  Data	
   SMOTE	
  200%	
  
SMOTE	
  200%	
  
SMOTE	
  200%	
  Original	
  Data	
  
Original	
  Data	
  
  10	
  
Past Cleaning Efforts
Discretization
As part of our initial data cleaning efforts, we discretized several nominal attributes in integer
identifiers. Those attributes are diagnosis1, diagnosis2, diagnosis3, admission type, discharge
disposition, and admission source. The first three attributes (diagnoses) were coded based on ICD9
(International Statistical Code of Diseases and Related Health Problems). For example, code IDs
390-459 and 785 are diseases of the circulatory system. After converting all the integer identifiers
into nominal values, the results did not show significant improvement.
Various SMOTE Percentages
We also applied different SMOTE percentages mainly to the <30 minority class. However, there
was no significant improvement. We ultimately decided to apply a 200% increase on the <30
minority class as mentioned earlier.
Class Distribution Graphs with Different SMOTE Percentages
350%,	
  <30	
   350%	
  on	
  <30,	
  50%	
  on	
  >30	
  
500%,	
  <30	
  250%,	
  <30	
  
  11	
  
Algorithms Utilized
After data cleaning and pre-processing, the selection of algorithms to run our experiments was
made. We selected three classifiers for the experiment design:
Classifiers
•   J48. It works on the Decision Tree Learning process to find and optimize the most
efficient attribute which increases the prediction accuracy.
•   Naïve Bayes. It takes a probabilistic approach to determine the attributes upon which a
model is to be built.
•   Bayes Net. It is a probabilistic graphical model that represents a set of random variables
and their conditional dependencies through a directed acyclic graph.
Comparison of Classifiers
Since we selected two Bayes classifiers, we compared the difference between Naïve Bayes and
Bayes Net.
A Naive Bayes classifier is a simple model that describes a particular class of the Bayesian
network: all the features are conditionally independent of each other. Because of this, there are
certain problems that Naive Bayes cannot solve. An advantage of Naive Bayes is it only requires
a small amount of training data to estimate the parameters necessary for classification.
A Bayesian Net models relationships between features in a very general way. The Bayesian
Network does not have such assumptions. All the dependence in the Bayesian Network has to be
modeled. If it is known what these relationships are, or there is enough data to derive them, then
it may be appropriate to use a Bayesian Network.
The following are two examples that illustrate the differences between these two algorithms.
In the first example, a fruit may be considered to be an apple if it is red, round, and about 10
centimeters in diameter. A Naive Bayes classifier considers each of these features to contribute
independently to the probability that this fruit is an apple, regardless of any possible correlations
between the color, roundness, and diameter features.
  12	
  
In the second example, presume that there are two events that could cause grass to be wet: either
the sprinkler is on, or it's raining. Also, presume that the rain has a direct effect on the use of the
sprinkler (i.e. when it rains, the sprinkler is usually not turned on). Then the situation can be
modeled with a Bayesian network. All three variables have two possible values, T (for true) and
F (for false). Two attributes (Sprinkler and Rain) are correlated.
  13	
  
Factor Experimental Design
We selected two factors to conduct our experiment design: Number of Attributes and Noise.
Each of the combinations, as shown in the 2-Factor Experimental Design table below, were run
in an experiment with each algorithm utilizing 10 random seeds.
Number of Attributes
Due to the amount of attributes this dataset contained originally and even after it was cleaned, we
decided to analyze the effect of decreasing the number of attributes. We compared the results of
experimental runs with the full, cleaned data set versus the results with a reduced, cleaned data
set.
We utilized a tool found in Weka called InfoGain. This tool evaluates the worth of the attribute
by measuring the information gain with respect to the class. The output from this tool ranked all
attributes and we selected the top 10. We then compared experiment results from the dataset
containing 22 attributes versus the same dataset containing only the top 10 attributes.
Noise
Noise refers to the modification of original values such as a distortion in voice during a phone
call or fuzziness on a computer screen that can’t be seen clearly. We wanted to observe the effect
on classification performance by adding noise into our data.
Noise was selected as our second factor in the experiment design. We carefully added 10% of
noise only to the target variable, ran the experiments and compared the results to the dataset
without noise.
2-Factor Experimental Design Table
ALLATTRIBUTES SELECTED ATTRIBUTES
NO NOISE
C1
All Attributes & No Noise
C3
Selected Attributes & No Noise
NOISE (10%)
C2
All Attributes & Noise
C4
Selected Attributes & Noise
  14	
  
Experiments
As previously mentioned, our experiment design was composed of two factors (Selected
Attributes and Noise), giving us four different sets of experiments to run:
•   C1: All Attributes & No Noise
•   C2: All Attributes & Noise
•   C3: Selected Attributes & No Noise
•   C4: Selected Attributes & Noise
Combination Sets with Each Classifier
	
  
C1	
  	
  
E1	
   Performance	
  of	
  J48	
  for	
  All	
  Attributes,	
  No	
  Noise	
  
E2	
   Performance	
  of	
  Naïve	
  Bayes	
  for	
  All	
  Attributes,	
  No	
  Noise	
  
E3	
   Performance	
  of	
  Bayes	
  Net	
  for	
  All	
  Attributes,	
  No	
  Noise	
  
	
  
C2	
  	
  
E4	
   Performance	
  of	
  J48	
  for	
  All	
  Attributes,	
  10%	
  Noise	
  
E5	
   Performance	
  of	
  Naïve	
  Bayes	
  for	
  All	
  Attributes,	
  10%	
  Noise	
  
E6	
   Performance	
  of	
  Bayes	
  Net	
  for	
  All	
  Attributes,	
  10%	
  Noise	
  
	
  
C3	
  	
  
	
  
E7	
   Performance	
  of	
  J48	
  for	
  Selected	
  Attributes,	
  No	
  Noise	
  
E8	
   Performance	
  of	
  Naïve	
  Bayes	
  for	
  Selected	
  Attributes,	
  No	
  Noise	
  
E9	
   Performance	
  of	
  Bayes	
  Net	
  for	
  Selected	
  Attributes,	
  No	
  Noise	
  
	
  
C4	
  	
  
E10	
   Performance	
  of	
  J48	
  for	
  Selected	
  Attributes,	
  10%	
  Noise	
  
E11	
   Performance	
  of	
  Naïve	
  Bayes	
  for	
  Selected	
  Attributes,	
  10%	
  Noise	
  
E12	
   Performance	
  of	
  Bayes	
  Net	
  for	
  Selected	
  Attributes,	
  10%	
  Noise	
  
	
  
Each of the experiments E1, E2,…, E12 was run 10 separate times with a different seed each
time, ensuring that the algorithm would use a slightly different training data set each time. For
each of the experiments, the percentage split was 66% training and 34% testing. For each C1-C4,
we use 3 different algorithms:
•   Experiments E1, E4, E7, and E10 use the J48 algorithm.
•   Experiments E2, E5, E8, and E11 use the Naives Bayes algorithm
•   Experiments E3, E6, E9, and E12 use the Bayes Net algorithm.
The following tables are the results of the experiments conducted:
	
  
	
  
	
  
	
  
	
  
	
  
	
  
  15	
  
Results Tables
	
  
E1	
  (J48)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   57.3161	
  
2	
   2	
   57.6682	
  
3	
   3	
   57.5814	
  
4	
   4	
   57.7960	
  
5	
   5	
   57.4174	
  
6	
   6	
   57.7502	
  
7	
   7	
   57.4680	
  
8	
   8	
   57.5838	
  
9	
   9	
   57.9696	
  
10	
   10	
   58.0709	
  
Average	
   57.6622	
  
Std	
  Dev	
   0.2397	
  
	
  
E2	
  (Naïve	
  Bayes)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   56.3468	
  
2	
   2	
   56.4794	
  
3	
   3	
   56.4673	
  
4	
   4	
   56.7085	
  
5	
   5	
   56.6458	
  
6	
   6	
   57.0051	
  
7	
   7	
   56.7519	
  
8	
   8	
   56.7977	
  
9	
   9	
   57.1039	
  
10	
   10	
   57.1208	
  
	
   Average	
   56.7427	
  
	
   Std	
  Dev	
   0.2704	
  
	
  
	
  
	
  
E3	
  (Bayes	
  Net)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   64.6274	
  
2	
   2	
   63.8679	
  
3	
   3	
   64.2633	
  
4	
   4	
   64.4418	
  
5	
   5	
   63.9474	
  
6	
   6	
   64.5816	
  
7	
   7	
   64.2416	
  
8	
   8	
   63.8968	
  
9	
   9	
   63.9354	
  
10	
   10	
   64.4225	
  
	
   Average	
   64.2226	
  
	
   Std	
  Dev	
   0.2931	
  
	
  
E4	
  (J48)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   53.5954	
  
2	
   2	
   53.6364	
  
3	
   3	
   53.2554	
  
4	
   4	
   53.7015	
  
5	
   5	
   53.4338	
  
6	
   6	
   53.2988	
  
7	
   7	
   53.1903	
  
8	
   8	
   53.4893	
  
9	
   9	
   53.7497	
  
10	
   10	
   53.6219	
  
	
   Average	
   53.49725	
  
	
   Std	
  Dev	
   0.196119	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
  16	
  
	
  
Results Tables
	
  
E5	
  (Naïve	
  Bayes)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   52.5826	
  
2	
   2	
   52.9274	
  
3	
   3	
   52.6405	
  
4	
   4	
   52.9708	
  
5	
   5	
   53.0769	
  
6	
   6	
   53.0311	
  
7	
   7	
   52.9539	
  
8	
   8	
   52.9395	
  
9	
   9	
   53.2771	
  
10	
   10	
   53.3856	
  
	
   Average	
   52.97854	
  
	
   Std	
  Dev	
   0.245655392	
  
	
  
E6	
  (Bayes	
  Net)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   59.4695	
  
2	
   2	
   59.3369	
  
3	
   3	
   59.5611	
  
4	
   4	
   59.8143	
  
5	
   5	
   59.5322	
  
6	
   6	
   59.5274	
  
7	
   7	
   59.4406	
  
8	
   8	
   59.2669	
  
9	
   9	
   59.3851	
  
10	
   10	
   59.6889	
  
	
   Average	
   59.50229	
  
	
   Std	
  Dev	
   0.162799778	
  
	
  
	
  
	
  
E7	
  (J48)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   57.415	
  
2	
   2	
   57.4849	
  
3	
   3	
   56.87	
  
4	
   4	
   57.3113	
  
5	
   5	
   56.776	
  
6	
   6	
   57.1907	
  
7	
   7	
   57.2631	
  
8	
   8	
   57.2004	
  
9	
   9	
   57.2438	
  
10	
   10	
   57.1739	
  
	
   Average	
   57.19291	
  
	
   Std	
  Dev	
   0.219752874	
  
	
  
E8	
  (Naïve	
  Bayes)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   55.2882	
  
2	
   2	
   55.4642	
  
3	
   3	
   55.4618	
  
4	
   4	
   55.469	
  
5	
   5	
   55.2472	
  
6	
   6	
   56.0043	
  
7	
   7	
   55.6137	
  
8	
   8	
   55.416	
  
9	
   9	
   55.5052	
  
10	
   10	
   55.5341	
  
	
   Average	
   55.50037	
  
	
   Std	
  Dev	
   0.207621349	
  
	
  
	
  
	
  
	
  
	
  
	
  
  17	
  
Results Tables
	
  
E9	
  (Bayes	
  Net)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   55.2882	
  
2	
   2	
   55.4642	
  
3	
   3	
   55.4618	
  
4	
   4	
   55.469	
  
5	
   5	
   55.2472	
  
6	
   6	
   56.0043	
  
7	
   7	
   55.6137	
  
8	
   8	
   55.416	
  
9	
   9	
   55.5052	
  
10	
   10	
   55.5341	
  
	
   Average	
   55.50037	
  
	
   Std	
  Dev	
   0.207621349	
  
	
  
E10	
  (J48)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   52.6887	
  
2	
   2	
   53.511	
  
3	
   3	
   52.9853	
  
4	
   4	
   53.1769	
  
5	
   5	
   52.7538	
  
6	
   6	
   52.9829	
  
7	
   7	
   52.7213	
  
8	
   8	
   52.7646	
  
9	
   9	
   53.1457	
  
10	
   10	
   53.1022	
  
	
   Average	
   52.9832	
  
	
   Std	
  Dev	
   0.2475	
  
	
  
	
  
	
  
E11	
  (Naïve	
  Bayes)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   51.4082	
  
2	
   2	
   51.9701	
  
3	
   3	
   51.8519	
  
4	
   4	
   51.9471	
  
5	
   5	
   51.6639	
  
6	
   6	
   51.6831	
  
7	
   7	
   51.6772	
  
8	
   8	
   51.6674	
  
9	
   9	
   51.7206	
  
10	
   10	
   51.9471	
  
	
   Average	
   51.7537	
  
	
   Std	
  Dev	
   0.1668	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
E12	
  (Bayes	
  Net)	
  
Run	
   Seed	
   Accuracy	
  
1	
   1	
   51.4082	
  
2	
   2	
   51.9701	
  
3	
   3	
   51.8519	
  
4	
   4	
   51.9471	
  
5	
   5	
   51.6639	
  
6	
   6	
   51.6831	
  
7	
   7	
   51.6772	
  
8	
   8	
   51.6674	
  
9	
   9	
   51.7206	
  
10	
   10	
   51.9471	
  
	
   Average	
   51.7537	
  
	
   Std	
  Dev	
   0.1668	
  
  18	
  
Summary of Results
	
  
Results	
  of	
  Experiments	
  Graph	
  
	
  
	
  
From the graph shown above, we can infer that C1 (All attributes & No Noise) is the best
experiment since it gives us the highest results across all algorithms. Next is C3 (Selected
Attributes & No Noise) which gives us slightly lesser results, but they are still significantly
higher than the results obtained from C2 (third best) and C4 (fourth best).
When it comes to the accuracy of the algorithms, Bayes Net leads with a significant margin over
J48 and Naïve Bayes across all four experiments. J48 comes in second position, performing up to
1.7% points higher than Naives Bayes across all experiments.
We can say the Selected Attributes that we considered for experiments in C3 and C4 are the most
relevant because, controlling for noise, the accuracy of the algorithms with all attributes declines
by less than 1.5% when switching to only the Selected Attributes.
	
  
	
  
C1,	
  56.74272
C2,	
  52.97854
C3,	
  55.50037
C4,	
  51.7537
C1,	
  57.66216
C2,	
  53.49725
C3,	
  57.19291
C4,	
  52.9832
64.22257
59.50229
63.85893
59.1443
51
53
55
57
59
61
63
65
C1 C2 C3 C4
Accuracy	
  Averages	
  (%)
NaiveBayes J48 BayesNet
  19	
  
Standard	
  Errors	
  Graph	
  
	
  
	
  
Looking at the different standard errors across experiments, we noticed that C1 stands out with
relatively high values and the highest mean standard error among all four experiments. The mean
standard error for C2, C3 and C4 are roughly the same.
The linear graph of the standard errors for each model shows different trends for each algorithm.
The area under the J48 curve form C1 to C4 is roughly similar to the one of Naïve Bayes, and as
far as size, they are both relatively high. BayesNet on the other hand, seems to have a slightly
lower standard error in general, with a smaller area under its curve. We can infer that Bayes Net
has the smallest standard error among all three algorithms used.
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
C1 C2 C3 C4
Standard	
  Errors
NaiveBayes J48 BayesNet
  20	
  
Analysis and Conclusion
In order to evaluate the performance of our algorithms, we will be using ROC curves.
ROC Curves
A receiver operating characteristic (ROC), or ROC curve, is defined as a graphical plot that
shows the performance of a binary classifier system as its discrimination threshold is varied. The
curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings. The true-positive rate is also known as sensitivity, or recall in machine
learning. The false-positive rate is also known as the fall-out and can be calculated as
(1 - specificity).
In order to determine which algorithm performs better, we will be looking at the tendency of
each curve. The closer the curve follows the Y-axis and then the top border, the larger the area is,
and the more accurate is the test.
In order to get a plot of the curves, we made use of Weka’s Knowledge Flow tool. We loaded the
following workflow through which we derived the curves:
	
  
Knowledge Flow
	
  
	
   	
  
The above Knowledge Flow merely loads the specified file, assigns which attribute is considered
as the class, then chooses which class value to plot the curve for, and then allows us to select a
percentage split. The three algorithms are then run on the dataset with the parameters selected,
their performance is recorded, and the results are used to plot the ROC curve.
  21	
  
The following are ROC curves for each experiment set when the class value is NO.
ROC Curve Graphs When Readmission is NO
C1	
  
	
  
	
  
	
  
C2	
  
	
  
  22	
  
ROC Curve Graphs When Readmission is NO
	
  
C3	
  
	
  
	
  
	
  
C4	
  
	
  
	
  
From the above graphs, we observe that the area under the curve is greater for Naïve Bayes, in
each of the four experiment sets, C1, C2, C3 and C4. We can therefore conclude that Naïve
Bayes is more accurate than the other algorithms, according to these ROC curves.
  23	
  
Additional Analysis
Let’s take a look at the confusion matrices we obtained from C1 and C3 (which give us the most
relevant models)
Confusion Matrix Tables for C1 and C3
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
   	
  
We can conclude from these confusion matrices that Bayes Net gives a higher percentage of
True Positives across class values <30, >30 and NO. Bayes Net therefore appears to be the best
predictor from the point of view of our confusion matrices.
Overall Observations
Considering the average accuracy, average standard deviation, ROC curves, attributes, and
classifier evaluation, we recommend the following for the Readmission of Diabetes Patients
dataset:
Class balancing - SMOTE increased overall model accuracy (see SMOTE Comparison Matrices
below)
Classifier - Bayes Net gives the highest accuracy. Naive Bayesian classifier (NBC) assumes
independence between all attributes given a class, which is seldom true. That is why it is called
“Naïve”. In contrast, in a Bayesian network you can make a more detailed (true) model of the
problem using several layers of dependencies. It can track the cause-effect relationship among
attributes and class, and at the same time calculate and draw the probabilistic graph.
Attributes factor: Using All attributes instead of Top 10 has the highest accuracy.
  24	
  
SMOTE Comparison Matrices
u   Original Data using J48
=== Confusion Matrix ===
a b c <-- classified as
15914 2664 117 | a = NO
8222 3728 141 | b = >30
2396 1371 47 | c = <30
u   After SMOTE 200% using J48
=== Confusion Matrix ===
a b c <-- classified as
13827 2683 1231 | a = NO
7596 3282 1331 | b = >30
3427 1433 6660 | c = <30
u   Original Data using Naives Bayes
=== Confusion Matrix ===
a b c <-- classified as
16009 2138 548 | a = NO
8230 3168 693 | b = >30
2430 927 457 | c = <30
u   After SMOTE 200% using Naives Bayes
=== Confusion Matrix ===
a b c <-- classified as
15294 1439 1008 | a = NO
8445 2092 1672 | b = >30
4721 818 5981 | c = <30
  25	
  
u   Original Data using BayesNet
=== Confusion Matrix ===
a b c <-- classified as
13302 4867 526 | a = NO
5138 6440 513 | b = >30
1662 1800 352 | c = <30
u   After SMOTE 200% using BayesNet
=== Confusion Matrix ===
a b c <-- classified as
12746 4934 61 | a = NO
5802 6312 95 | b = >30
1714 2063 7743 | c = <30
  26	
  
References
•   Beata Strack et al., “Impact of HbA1c Measurement on Hospital Readmission Rates:
Analysis of 70,000 Clinical Database Patient Records.”
•   Hindawi Publishing Corporation -
http://www.hindawi.com/journals/bmri/2014/781670/tab1/
•   Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools
and Techniques, 3rd
edition, Elsevier, 2011
•   Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Diabetes+130-
US+hospitals+for+years+1999-2008#
•   Naïve Bayes for Dummies - http://blog.aylien.com/post/120703930533/naive-bayes-for-
dummies-a-simple-explanation
•   Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor
Work’ is Key Hurdle to Insights.
•   Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality,
Wiley, 2004.
•   Wikipedia - https://en.wikipedia.org/wiki/Data_cleansing
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
https://en.wikipedia.org/wiki/Bayesian_network

Contenu connexe

Tendances

Therapeutic Drug Monitoring
Therapeutic Drug MonitoringTherapeutic Drug Monitoring
Therapeutic Drug MonitoringPravin Prasad
 
Roles & Responsibilities of EC Dr Pramod.pptx
Roles & Responsibilities of EC Dr Pramod.pptxRoles & Responsibilities of EC Dr Pramod.pptx
Roles & Responsibilities of EC Dr Pramod.pptxDrPramod Kumar
 
Detection, reporting and monitoring of ad rs final ppt
Detection, reporting and monitoring of ad rs final pptDetection, reporting and monitoring of ad rs final ppt
Detection, reporting and monitoring of ad rs final pptSabeena Choudhary
 
pharmacogenomics by vaiibhavi
pharmacogenomics by vaiibhavipharmacogenomics by vaiibhavi
pharmacogenomics by vaiibhavishaikhazaroddin
 
7.Safety Monitoring in Clinical Trails.pptx
7.Safety Monitoring in Clinical Trails.pptx7.Safety Monitoring in Clinical Trails.pptx
7.Safety Monitoring in Clinical Trails.pptxbrahmaiahmph
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolSelman Bozkır
 
Final ethics presentation novoice
Final ethics presentation novoiceFinal ethics presentation novoice
Final ethics presentation novoiceJaclynCNicholson
 
Database Designing in Clinical Data Management
Database Designing in Clinical Data ManagementDatabase Designing in Clinical Data Management
Database Designing in Clinical Data ManagementClinosolIndia
 

Tendances (13)

Medical Billing
Medical BillingMedical Billing
Medical Billing
 
Therapeutic Drug Monitoring
Therapeutic Drug MonitoringTherapeutic Drug Monitoring
Therapeutic Drug Monitoring
 
A researcher's guide to understanding clinical trials
A researcher's guide to understanding clinical trialsA researcher's guide to understanding clinical trials
A researcher's guide to understanding clinical trials
 
Site qualification visit
Site qualification visitSite qualification visit
Site qualification visit
 
Roles & Responsibilities of EC Dr Pramod.pptx
Roles & Responsibilities of EC Dr Pramod.pptxRoles & Responsibilities of EC Dr Pramod.pptx
Roles & Responsibilities of EC Dr Pramod.pptx
 
Detection, reporting and monitoring of ad rs final ppt
Detection, reporting and monitoring of ad rs final pptDetection, reporting and monitoring of ad rs final ppt
Detection, reporting and monitoring of ad rs final ppt
 
pharmacogenomics by vaiibhavi
pharmacogenomics by vaiibhavipharmacogenomics by vaiibhavi
pharmacogenomics by vaiibhavi
 
7.Safety Monitoring in Clinical Trails.pptx
7.Safety Monitoring in Clinical Trails.pptx7.Safety Monitoring in Clinical Trails.pptx
7.Safety Monitoring in Clinical Trails.pptx
 
FUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis ToolFUAT – A Fuzzy Clustering Analysis Tool
FUAT – A Fuzzy Clustering Analysis Tool
 
Final ethics presentation novoice
Final ethics presentation novoiceFinal ethics presentation novoice
Final ethics presentation novoice
 
Database Designing in Clinical Data Management
Database Designing in Clinical Data ManagementDatabase Designing in Clinical Data Management
Database Designing in Clinical Data Management
 
CDISC-CDASH
CDISC-CDASHCDISC-CDASH
CDISC-CDASH
 
Spontaneous reporting
Spontaneous reporting Spontaneous reporting
Spontaneous reporting
 

En vedette

Group6SDFinal
Group6SDFinalGroup6SDFinal
Group6SDFinalHong Lu
 
KLBConsultingServicesSOHHPaperFINAL
KLBConsultingServicesSOHHPaperFINALKLBConsultingServicesSOHHPaperFINAL
KLBConsultingServicesSOHHPaperFINALHong Lu
 
Classifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient EncountersClassifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient EncountersMayur Srinivasan
 
Health care data - survivial analysis, draft
Health care data - survivial analysis, draftHealth care data - survivial analysis, draft
Health care data - survivial analysis, draftHabet Madoyan
 
5 Vital Tips to Help Reduce Readmissions in Hospitals
5 Vital Tips to Help Reduce Readmissions in Hospitals5 Vital Tips to Help Reduce Readmissions in Hospitals
5 Vital Tips to Help Reduce Readmissions in HospitalsJuran Global
 
Presentation at UHC Annual Meeting
Presentation at UHC  Annual MeetingPresentation at UHC  Annual Meeting
Presentation at UHC Annual MeetingJoel Saltz
 

En vedette (8)

Group6SDFinal
Group6SDFinalGroup6SDFinal
Group6SDFinal
 
KLBConsultingServicesSOHHPaperFINAL
KLBConsultingServicesSOHHPaperFINALKLBConsultingServicesSOHHPaperFINAL
KLBConsultingServicesSOHHPaperFINAL
 
Classifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient EncountersClassifying Readmissions of Diabetic Patient Encounters
Classifying Readmissions of Diabetic Patient Encounters
 
Health care data - survivial analysis, draft
Health care data - survivial analysis, draftHealth care data - survivial analysis, draft
Health care data - survivial analysis, draft
 
5 Vital Tips to Help Reduce Readmissions in Hospitals
5 Vital Tips to Help Reduce Readmissions in Hospitals5 Vital Tips to Help Reduce Readmissions in Hospitals
5 Vital Tips to Help Reduce Readmissions in Hospitals
 
Presentation at UHC Annual Meeting
Presentation at UHC  Annual MeetingPresentation at UHC  Annual Meeting
Presentation at UHC Annual Meeting
 
Does metadata matter?
Does metadata matter?Does metadata matter?
Does metadata matter?
 
Inpatient Management of Hyperglycemia
Inpatient Management of HyperglycemiaInpatient Management of Hyperglycemia
Inpatient Management of Hyperglycemia
 

Similaire à Readmission of Diabetes Patients Report

Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cDamian R. Mingle, MBA
 
statistics in pharmaceutical sciences
statistics in pharmaceutical sciencesstatistics in pharmaceutical sciences
statistics in pharmaceutical sciencesTechmasi
 
Predictive Modeling: White Paper
Predictive Modeling: White PaperPredictive Modeling: White Paper
Predictive Modeling: White PaperYashi Sarbhai
 
Measurement and Modeling Issues with Adherence to Pharmacotherapy
Measurement and Modeling Issues with Adherence to PharmacotherapyMeasurement and Modeling Issues with Adherence to Pharmacotherapy
Measurement and Modeling Issues with Adherence to PharmacotherapyM. Christopher Roebuck
 
Statistics Introduction In Pharmacy
Statistics Introduction In PharmacyStatistics Introduction In Pharmacy
Statistics Introduction In PharmacyPharmacy Universe
 
MedicalResearch.com: Medical Research Interviews
MedicalResearch.com:  Medical Research InterviewsMedicalResearch.com:  Medical Research Interviews
MedicalResearch.com: Medical Research InterviewsMarie Benz MD FAAD
 
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptintroductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptDr.Venkata Suresh Ponnuru
 
Using real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsUsing real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsKarin Verspoor
 
Clinical Research Informatics (CRI) Year-in-Review 2014
Clinical Research Informatics (CRI) Year-in-Review 2014Clinical Research Informatics (CRI) Year-in-Review 2014
Clinical Research Informatics (CRI) Year-in-Review 2014Peter Embi
 
Normal and Referance Range
Normal and Referance RangeNormal and Referance Range
Normal and Referance RangePrakash Mishra
 
David Madigan MedicReS World Congress 2014
David Madigan MedicReS World Congress 2014David Madigan MedicReS World Congress 2014
David Madigan MedicReS World Congress 2014MedicReS
 
Therapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
Therapeutic_Innovation_&_Regulatory_Science-2015-TantsyuraTherapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
Therapeutic_Innovation_&_Regulatory_Science-2015-TantsyuraVadim Tantsyura
 
Data mining for diabetes readmission final
Data mining for diabetes readmission finalData mining for diabetes readmission final
Data mining for diabetes readmission finalXiayu (Carol) Zeng
 
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...IJDKP
 
Confronting Diagnostic Error-Employer
Confronting Diagnostic Error-EmployerConfronting Diagnostic Error-Employer
Confronting Diagnostic Error-EmployerMelissa Kay Palardy
 
EMR as a highly powerful European RWD source
EMR as a highly powerful European RWD sourceEMR as a highly powerful European RWD source
EMR as a highly powerful European RWD sourceIMSHealthRWES
 
Preprint review article letter to all pharmacist 2016 pharmaceutical care la...
Preprint review article letter to all pharmacist 2016 pharmaceutical care  la...Preprint review article letter to all pharmacist 2016 pharmaceutical care  la...
Preprint review article letter to all pharmacist 2016 pharmaceutical care la...M. Luisetto Pharm.D.Spec. Pharmacology
 
(Critical Appraisal Tools Worksheet Template)Evalua.docx
 (Critical Appraisal Tools Worksheet Template)Evalua.docx (Critical Appraisal Tools Worksheet Template)Evalua.docx
(Critical Appraisal Tools Worksheet Template)Evalua.docxShiraPrater50
 

Similaire à Readmission of Diabetes Patients Report (20)

Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
 
statistics in pharmaceutical sciences
statistics in pharmaceutical sciencesstatistics in pharmaceutical sciences
statistics in pharmaceutical sciences
 
Predictive Modeling: White Paper
Predictive Modeling: White PaperPredictive Modeling: White Paper
Predictive Modeling: White Paper
 
Measurement and Modeling Issues with Adherence to Pharmacotherapy
Measurement and Modeling Issues with Adherence to PharmacotherapyMeasurement and Modeling Issues with Adherence to Pharmacotherapy
Measurement and Modeling Issues with Adherence to Pharmacotherapy
 
Statistics Introduction In Pharmacy
Statistics Introduction In PharmacyStatistics Introduction In Pharmacy
Statistics Introduction In Pharmacy
 
MedicalResearch.com: Medical Research Interviews
MedicalResearch.com:  Medical Research InterviewsMedicalResearch.com:  Medical Research Interviews
MedicalResearch.com: Medical Research Interviews
 
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptintroductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
 
Using real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsUsing real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questions
 
Clinical Research Informatics (CRI) Year-in-Review 2014
Clinical Research Informatics (CRI) Year-in-Review 2014Clinical Research Informatics (CRI) Year-in-Review 2014
Clinical Research Informatics (CRI) Year-in-Review 2014
 
American Journal of Emergency & Critical Care Medicine
American Journal of Emergency & Critical Care MedicineAmerican Journal of Emergency & Critical Care Medicine
American Journal of Emergency & Critical Care Medicine
 
Normal and Referance Range
Normal and Referance RangeNormal and Referance Range
Normal and Referance Range
 
David Madigan MedicReS World Congress 2014
David Madigan MedicReS World Congress 2014David Madigan MedicReS World Congress 2014
David Madigan MedicReS World Congress 2014
 
Predictive Medicine
Predictive Medicine Predictive Medicine
Predictive Medicine
 
Therapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
Therapeutic_Innovation_&_Regulatory_Science-2015-TantsyuraTherapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
Therapeutic_Innovation_&_Regulatory_Science-2015-Tantsyura
 
Data mining for diabetes readmission final
Data mining for diabetes readmission finalData mining for diabetes readmission final
Data mining for diabetes readmission final
 
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
EXAMINING THE EFFECT OF FEATURE SELECTION ON IMPROVING PATIENT DETERIORATION ...
 
Confronting Diagnostic Error-Employer
Confronting Diagnostic Error-EmployerConfronting Diagnostic Error-Employer
Confronting Diagnostic Error-Employer
 
EMR as a highly powerful European RWD source
EMR as a highly powerful European RWD sourceEMR as a highly powerful European RWD source
EMR as a highly powerful European RWD source
 
Preprint review article letter to all pharmacist 2016 pharmaceutical care la...
Preprint review article letter to all pharmacist 2016 pharmaceutical care  la...Preprint review article letter to all pharmacist 2016 pharmaceutical care  la...
Preprint review article letter to all pharmacist 2016 pharmaceutical care la...
 
(Critical Appraisal Tools Worksheet Template)Evalua.docx
 (Critical Appraisal Tools Worksheet Template)Evalua.docx (Critical Appraisal Tools Worksheet Template)Evalua.docx
(Critical Appraisal Tools Worksheet Template)Evalua.docx
 

Readmission of Diabetes Patients Report

  • 1. Readmission of Diabetes Patients Project Report by Rahmawati Nusantari Maria D. Marroquin Essenam Kakpo Hong Lu Team 10 INSY 5339 – Th. 7 pm-9:50 pm May 12, 2016
  • 2.   2   Table of Contents Problem Domain 3 Data Summary 3 Encounters 3 Features 3 Target Variable 6 Prediction 6 Data Cleaning Process 7 Data Cleaning Tools 7 Missing Values 7 Irrelevant Data 7 Data Imbalance 8 Past Cleaning Efforts 10 Discretization 10 Various SMOTE Percentages 10 Algorithms Utilized 11 Classifiers 11 Comparison of Bayes Classifiers 11 Factor Experimental Design 13 Number of Attributes 13 Noise 13 Experiments 14 Combination Sets with Each Classifier 14 Summary of Results 18 Analysis and Conclusion 20 ROC Curves 20 Additional Analysis 23 Overall Observations 23 References 26
  • 3.   3   Problem Domain Dataset Summary The dataset was obtained from the UCI Machine Learning Repository. It is listed under the name Diabetes 130 – US Hospitals. According to the dataset description, the data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes. The dataset represents 10 years (1999-2008) of clinical care at 130 U.S. hospitals and integrated delivery networks. This dataset contains 101,766 unique inpatients encounters (instances) with 50 attributes, making the size of this dataset a total of 5,088,300 cells. Encounters (Records) As stated on the UCI’s dataset information page, the dataset contains encounters that satisfied the following criteria: •   It is an inpatient encounter (a hospital admission). •   It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis. •   The length of stay was at least 1 day and at most 14 days. •   Laboratory tests were performed during the encounter. •   Medications were administered during the encounter. Features (Attributes) The attributes represent patient and hospital outcomes. This data set mostly contains nominal attributes such as medical specialty and gender, but also includes a few ordinal attributes such as age and weight and continues attributes such as time(days) in hospital and number of medications. The following table list each attribute, its description, and the percentage of missing information pertaining to each attribute.
  • 4.   4   Attributes and Target Variable Table Feature name Type Description and values % missing Encounter ID Numeric Unique identifier of an encounter 0% Patient number Numeric Unique identifier of a patient 0% Race Nominal Values: Caucasian, Asian, African American, Hispanic, and other 2% Gender Nominal Values: male, female, and unknown/invalid 0% Age Nominal Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100) 0% Weight Numeric Weight in pounds. 97% Admission type Nominal Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available 0% Discharge disposition Nominal Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available 0% Admission source Nominal Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital 0% Time in hospital Numeric Integer number of days between admission and discharge 0% Payer code Nominal Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay 52% Medical specialty Nominal Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon 53% Number of lab procedures Numeric Number of lab tests performed during the encounter 0% Number of procedures Numeric Number of procedures (other than lab tests) performed during the encounter 0% Number of medications Numeric Number of distinct generic names administered during the encounter 0% Number of outpatient visits Numeric Number of outpatient visits of the patient in the year preceding the encounter 0% Number of emergency visits Numeric Number of emergency visits of the patient in the year preceding the encounter 0%
  • 5.   5   Feature name Type Description and values % missing Number of inpatient visits Numeric Number of inpatient visits of the patient in the year preceding the encounter 0% Diagnosis 1 Nominal The primary diagnosis (coded as first three digits of ICD9); 848 distinct values 0% Diagnosis 2 Nominal Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values 0% Diagnosis 3 Nominal Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values 1% Number of diagnoses Numeric Number of diagnoses entered to the system 0% Glucose serum test result Nominal Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured 0% A1c test result Nominal Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured. 0% Change of medications Nominal Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change” 0% Diabetes medications Nominal Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” 0% 24 features for medications Nominal For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride-pioglitazone, metformin-rosiglitazone, and metformin-pioglitazone, the feature indicates whether the drug was prescribed or there was a change in the dosage. Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed 0% Readmitted Nominal Days to inpatient readmission. Values: “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission. 0%
  • 6.   6   Target Variable The last attribute in the previous table is the class attribute, which in this case is Readmission. The distribution of the class attribute is as follows: •   Encounters of patients who were not readmitted (No) to the hospital. There are 54, 864 of such encounters. •   Encounters of patients who were readmitted to the hospital after 30 days of discharge (>30). There are 35,545 of such encounters. •   Encounters of patients who were readmitted to the hospital within 30 days of discharge (<30). There are 11, 357 of such encounters. Prediction We want to predict whether or when diabetes patients will be readmitted to the hospital based on several factors (attributes). >30 No <30 Readmission ?
  • 7.   7   Data Cleaning Process Data cleaning is commonly defined as the process of detecting and correcting corrupt or inaccurate records from a dataset, table, or database.1 Data quality is an important component in any data mining efforts. For this reason, many data scientists spend from 50% to 80% of their time preparing and cleaning their data before it can be mined for insights.2 There are four broad categories of data quality problems: missing data, abnormal data (outliers), departure from models, and goodness-of-fit.3 For our project, our team mainly dealt with missing data. Our team will also address the imbalance in the class variable using SMOTE. Data Cleaning Tools Our team utilized Microsoft Excel to perform the data cleaning. As our guidance to understand the variables and meaning of the data, we consulted the research article that owned the data: “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records” by Beata Strack et al. Missing Values The journal identified three attributes with the majority of their records missing such as weight (97%), payer code (52%), and medical specialty (53%). Weight was not properly recorded since this experiment was done prior to the HITECH legislation of the American Reinvestment and Recovery Act in 2009, while payer code was deemed irrelevant by the researchers. As a result, these 3 attributes were deleted. There were also 23 attributes that had zero values in 79% to 99% of their records. Those are medications features such as metformin and other generic medications. The zero value indicated that the type of medication was not prescribed to the patient. As a result, all these 23 attributes were deleted. However, insulin was the only medication attribute retained since it had more than 50% of data in its records, and it is considered prevalent in diabetic patient cases. Irrelevant Data The class attribute determines whether a patient is readmitted in the hospital within 30 days, over 30 days, or not readmitted at all. The attribute, discharge disposition, corresponds to 29 distinct values that indicate patients are discharged to home or another hospital, to hospice for terminally- ill patients, or indicate that the patients have passed away.                                                                                                                 1 https://en.wikipedia.org/wiki/Data_cleansing 2 Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights. 3 Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004
  • 8.   8   To correctly include only active (alive) patients and not in hospice, we removed records that had Discharge Disposition codes of 11, 13, 14, 19, 20, and 21. These discharge codes matched the instances of patients who were deceased or sent to hospice. This cleaning process removed 2,423 instances. Data Imbalance SMOTE (Synthetic Minority Oversampling Technique) is a filter that samples the data and alters the class distribution. It can be used to adjust the relative frequency between the minority and majority classes in the data. SMOTE does not under-sample the majority classes. Instead, it oversamples the minority class by creating synthetic instances using a K-Nearest-Neighbor approach. The user can specify the oversampling percentage and the number of neighbors to use when creating synthetic instances.4 Our team applied SMOTE in different combinations and ultimately decided to apply a 200% synthetic minority oversample with 3-nearest-neighbors as shown below. SMOTE filter in WEKA                                                                                                                 4 Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Elsevier, 2011
  • 9.   9   The following graphs and matrices represent the comparison of the data before the SMOTE and after the 200% SMOTE applied to the minority class (<30). Class Distribution Graphs Before and After 200% SMOTE Confusion Matrices Before and After 200% SMOTE Using J-48 Using BayesNet     Original  Data   SMOTE  200%   SMOTE  200%   SMOTE  200%  Original  Data   Original  Data  
  • 10.   10   Past Cleaning Efforts Discretization As part of our initial data cleaning efforts, we discretized several nominal attributes in integer identifiers. Those attributes are diagnosis1, diagnosis2, diagnosis3, admission type, discharge disposition, and admission source. The first three attributes (diagnoses) were coded based on ICD9 (International Statistical Code of Diseases and Related Health Problems). For example, code IDs 390-459 and 785 are diseases of the circulatory system. After converting all the integer identifiers into nominal values, the results did not show significant improvement. Various SMOTE Percentages We also applied different SMOTE percentages mainly to the <30 minority class. However, there was no significant improvement. We ultimately decided to apply a 200% increase on the <30 minority class as mentioned earlier. Class Distribution Graphs with Different SMOTE Percentages 350%,  <30   350%  on  <30,  50%  on  >30   500%,  <30  250%,  <30  
  • 11.   11   Algorithms Utilized After data cleaning and pre-processing, the selection of algorithms to run our experiments was made. We selected three classifiers for the experiment design: Classifiers •   J48. It works on the Decision Tree Learning process to find and optimize the most efficient attribute which increases the prediction accuracy. •   Naïve Bayes. It takes a probabilistic approach to determine the attributes upon which a model is to be built. •   Bayes Net. It is a probabilistic graphical model that represents a set of random variables and their conditional dependencies through a directed acyclic graph. Comparison of Classifiers Since we selected two Bayes classifiers, we compared the difference between Naïve Bayes and Bayes Net. A Naive Bayes classifier is a simple model that describes a particular class of the Bayesian network: all the features are conditionally independent of each other. Because of this, there are certain problems that Naive Bayes cannot solve. An advantage of Naive Bayes is it only requires a small amount of training data to estimate the parameters necessary for classification. A Bayesian Net models relationships between features in a very general way. The Bayesian Network does not have such assumptions. All the dependence in the Bayesian Network has to be modeled. If it is known what these relationships are, or there is enough data to derive them, then it may be appropriate to use a Bayesian Network. The following are two examples that illustrate the differences between these two algorithms. In the first example, a fruit may be considered to be an apple if it is red, round, and about 10 centimeters in diameter. A Naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.
  • 12.   12   In the second example, presume that there are two events that could cause grass to be wet: either the sprinkler is on, or it's raining. Also, presume that the rain has a direct effect on the use of the sprinkler (i.e. when it rains, the sprinkler is usually not turned on). Then the situation can be modeled with a Bayesian network. All three variables have two possible values, T (for true) and F (for false). Two attributes (Sprinkler and Rain) are correlated.
  • 13.   13   Factor Experimental Design We selected two factors to conduct our experiment design: Number of Attributes and Noise. Each of the combinations, as shown in the 2-Factor Experimental Design table below, were run in an experiment with each algorithm utilizing 10 random seeds. Number of Attributes Due to the amount of attributes this dataset contained originally and even after it was cleaned, we decided to analyze the effect of decreasing the number of attributes. We compared the results of experimental runs with the full, cleaned data set versus the results with a reduced, cleaned data set. We utilized a tool found in Weka called InfoGain. This tool evaluates the worth of the attribute by measuring the information gain with respect to the class. The output from this tool ranked all attributes and we selected the top 10. We then compared experiment results from the dataset containing 22 attributes versus the same dataset containing only the top 10 attributes. Noise Noise refers to the modification of original values such as a distortion in voice during a phone call or fuzziness on a computer screen that can’t be seen clearly. We wanted to observe the effect on classification performance by adding noise into our data. Noise was selected as our second factor in the experiment design. We carefully added 10% of noise only to the target variable, ran the experiments and compared the results to the dataset without noise. 2-Factor Experimental Design Table ALLATTRIBUTES SELECTED ATTRIBUTES NO NOISE C1 All Attributes & No Noise C3 Selected Attributes & No Noise NOISE (10%) C2 All Attributes & Noise C4 Selected Attributes & Noise
  • 14.   14   Experiments As previously mentioned, our experiment design was composed of two factors (Selected Attributes and Noise), giving us four different sets of experiments to run: •   C1: All Attributes & No Noise •   C2: All Attributes & Noise •   C3: Selected Attributes & No Noise •   C4: Selected Attributes & Noise Combination Sets with Each Classifier   C1     E1   Performance  of  J48  for  All  Attributes,  No  Noise   E2   Performance  of  Naïve  Bayes  for  All  Attributes,  No  Noise   E3   Performance  of  Bayes  Net  for  All  Attributes,  No  Noise     C2     E4   Performance  of  J48  for  All  Attributes,  10%  Noise   E5   Performance  of  Naïve  Bayes  for  All  Attributes,  10%  Noise   E6   Performance  of  Bayes  Net  for  All  Attributes,  10%  Noise     C3       E7   Performance  of  J48  for  Selected  Attributes,  No  Noise   E8   Performance  of  Naïve  Bayes  for  Selected  Attributes,  No  Noise   E9   Performance  of  Bayes  Net  for  Selected  Attributes,  No  Noise     C4     E10   Performance  of  J48  for  Selected  Attributes,  10%  Noise   E11   Performance  of  Naïve  Bayes  for  Selected  Attributes,  10%  Noise   E12   Performance  of  Bayes  Net  for  Selected  Attributes,  10%  Noise     Each of the experiments E1, E2,…, E12 was run 10 separate times with a different seed each time, ensuring that the algorithm would use a slightly different training data set each time. For each of the experiments, the percentage split was 66% training and 34% testing. For each C1-C4, we use 3 different algorithms: •   Experiments E1, E4, E7, and E10 use the J48 algorithm. •   Experiments E2, E5, E8, and E11 use the Naives Bayes algorithm •   Experiments E3, E6, E9, and E12 use the Bayes Net algorithm. The following tables are the results of the experiments conducted:              
  • 15.   15   Results Tables   E1  (J48)   Run   Seed   Accuracy   1   1   57.3161   2   2   57.6682   3   3   57.5814   4   4   57.7960   5   5   57.4174   6   6   57.7502   7   7   57.4680   8   8   57.5838   9   9   57.9696   10   10   58.0709   Average   57.6622   Std  Dev   0.2397     E2  (Naïve  Bayes)   Run   Seed   Accuracy   1   1   56.3468   2   2   56.4794   3   3   56.4673   4   4   56.7085   5   5   56.6458   6   6   57.0051   7   7   56.7519   8   8   56.7977   9   9   57.1039   10   10   57.1208     Average   56.7427     Std  Dev   0.2704         E3  (Bayes  Net)   Run   Seed   Accuracy   1   1   64.6274   2   2   63.8679   3   3   64.2633   4   4   64.4418   5   5   63.9474   6   6   64.5816   7   7   64.2416   8   8   63.8968   9   9   63.9354   10   10   64.4225     Average   64.2226     Std  Dev   0.2931     E4  (J48)   Run   Seed   Accuracy   1   1   53.5954   2   2   53.6364   3   3   53.2554   4   4   53.7015   5   5   53.4338   6   6   53.2988   7   7   53.1903   8   8   53.4893   9   9   53.7497   10   10   53.6219     Average   53.49725     Std  Dev   0.196119                
  • 16.   16     Results Tables   E5  (Naïve  Bayes)   Run   Seed   Accuracy   1   1   52.5826   2   2   52.9274   3   3   52.6405   4   4   52.9708   5   5   53.0769   6   6   53.0311   7   7   52.9539   8   8   52.9395   9   9   53.2771   10   10   53.3856     Average   52.97854     Std  Dev   0.245655392     E6  (Bayes  Net)   Run   Seed   Accuracy   1   1   59.4695   2   2   59.3369   3   3   59.5611   4   4   59.8143   5   5   59.5322   6   6   59.5274   7   7   59.4406   8   8   59.2669   9   9   59.3851   10   10   59.6889     Average   59.50229     Std  Dev   0.162799778         E7  (J48)   Run   Seed   Accuracy   1   1   57.415   2   2   57.4849   3   3   56.87   4   4   57.3113   5   5   56.776   6   6   57.1907   7   7   57.2631   8   8   57.2004   9   9   57.2438   10   10   57.1739     Average   57.19291     Std  Dev   0.219752874     E8  (Naïve  Bayes)   Run   Seed   Accuracy   1   1   55.2882   2   2   55.4642   3   3   55.4618   4   4   55.469   5   5   55.2472   6   6   56.0043   7   7   55.6137   8   8   55.416   9   9   55.5052   10   10   55.5341     Average   55.50037     Std  Dev   0.207621349              
  • 17.   17   Results Tables   E9  (Bayes  Net)   Run   Seed   Accuracy   1   1   55.2882   2   2   55.4642   3   3   55.4618   4   4   55.469   5   5   55.2472   6   6   56.0043   7   7   55.6137   8   8   55.416   9   9   55.5052   10   10   55.5341     Average   55.50037     Std  Dev   0.207621349     E10  (J48)   Run   Seed   Accuracy   1   1   52.6887   2   2   53.511   3   3   52.9853   4   4   53.1769   5   5   52.7538   6   6   52.9829   7   7   52.7213   8   8   52.7646   9   9   53.1457   10   10   53.1022     Average   52.9832     Std  Dev   0.2475         E11  (Naïve  Bayes)   Run   Seed   Accuracy   1   1   51.4082   2   2   51.9701   3   3   51.8519   4   4   51.9471   5   5   51.6639   6   6   51.6831   7   7   51.6772   8   8   51.6674   9   9   51.7206   10   10   51.9471     Average   51.7537     Std  Dev   0.1668                     E12  (Bayes  Net)   Run   Seed   Accuracy   1   1   51.4082   2   2   51.9701   3   3   51.8519   4   4   51.9471   5   5   51.6639   6   6   51.6831   7   7   51.6772   8   8   51.6674   9   9   51.7206   10   10   51.9471     Average   51.7537     Std  Dev   0.1668  
  • 18.   18   Summary of Results   Results  of  Experiments  Graph       From the graph shown above, we can infer that C1 (All attributes & No Noise) is the best experiment since it gives us the highest results across all algorithms. Next is C3 (Selected Attributes & No Noise) which gives us slightly lesser results, but they are still significantly higher than the results obtained from C2 (third best) and C4 (fourth best). When it comes to the accuracy of the algorithms, Bayes Net leads with a significant margin over J48 and Naïve Bayes across all four experiments. J48 comes in second position, performing up to 1.7% points higher than Naives Bayes across all experiments. We can say the Selected Attributes that we considered for experiments in C3 and C4 are the most relevant because, controlling for noise, the accuracy of the algorithms with all attributes declines by less than 1.5% when switching to only the Selected Attributes.     C1,  56.74272 C2,  52.97854 C3,  55.50037 C4,  51.7537 C1,  57.66216 C2,  53.49725 C3,  57.19291 C4,  52.9832 64.22257 59.50229 63.85893 59.1443 51 53 55 57 59 61 63 65 C1 C2 C3 C4 Accuracy  Averages  (%) NaiveBayes J48 BayesNet
  • 19.   19   Standard  Errors  Graph       Looking at the different standard errors across experiments, we noticed that C1 stands out with relatively high values and the highest mean standard error among all four experiments. The mean standard error for C2, C3 and C4 are roughly the same. The linear graph of the standard errors for each model shows different trends for each algorithm. The area under the J48 curve form C1 to C4 is roughly similar to the one of Naïve Bayes, and as far as size, they are both relatively high. BayesNet on the other hand, seems to have a slightly lower standard error in general, with a smaller area under its curve. We can infer that Bayes Net has the smallest standard error among all three algorithms used.                       0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 C1 C2 C3 C4 Standard  Errors NaiveBayes J48 BayesNet
  • 20.   20   Analysis and Conclusion In order to evaluate the performance of our algorithms, we will be using ROC curves. ROC Curves A receiver operating characteristic (ROC), or ROC curve, is defined as a graphical plot that shows the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). In order to determine which algorithm performs better, we will be looking at the tendency of each curve. The closer the curve follows the Y-axis and then the top border, the larger the area is, and the more accurate is the test. In order to get a plot of the curves, we made use of Weka’s Knowledge Flow tool. We loaded the following workflow through which we derived the curves:   Knowledge Flow       The above Knowledge Flow merely loads the specified file, assigns which attribute is considered as the class, then chooses which class value to plot the curve for, and then allows us to select a percentage split. The three algorithms are then run on the dataset with the parameters selected, their performance is recorded, and the results are used to plot the ROC curve.
  • 21.   21   The following are ROC curves for each experiment set when the class value is NO. ROC Curve Graphs When Readmission is NO C1         C2    
  • 22.   22   ROC Curve Graphs When Readmission is NO   C3         C4       From the above graphs, we observe that the area under the curve is greater for Naïve Bayes, in each of the four experiment sets, C1, C2, C3 and C4. We can therefore conclude that Naïve Bayes is more accurate than the other algorithms, according to these ROC curves.
  • 23.   23   Additional Analysis Let’s take a look at the confusion matrices we obtained from C1 and C3 (which give us the most relevant models) Confusion Matrix Tables for C1 and C3                           We can conclude from these confusion matrices that Bayes Net gives a higher percentage of True Positives across class values <30, >30 and NO. Bayes Net therefore appears to be the best predictor from the point of view of our confusion matrices. Overall Observations Considering the average accuracy, average standard deviation, ROC curves, attributes, and classifier evaluation, we recommend the following for the Readmission of Diabetes Patients dataset: Class balancing - SMOTE increased overall model accuracy (see SMOTE Comparison Matrices below) Classifier - Bayes Net gives the highest accuracy. Naive Bayesian classifier (NBC) assumes independence between all attributes given a class, which is seldom true. That is why it is called “Naïve”. In contrast, in a Bayesian network you can make a more detailed (true) model of the problem using several layers of dependencies. It can track the cause-effect relationship among attributes and class, and at the same time calculate and draw the probabilistic graph. Attributes factor: Using All attributes instead of Top 10 has the highest accuracy.
  • 24.   24   SMOTE Comparison Matrices u   Original Data using J48 === Confusion Matrix === a b c <-- classified as 15914 2664 117 | a = NO 8222 3728 141 | b = >30 2396 1371 47 | c = <30 u   After SMOTE 200% using J48 === Confusion Matrix === a b c <-- classified as 13827 2683 1231 | a = NO 7596 3282 1331 | b = >30 3427 1433 6660 | c = <30 u   Original Data using Naives Bayes === Confusion Matrix === a b c <-- classified as 16009 2138 548 | a = NO 8230 3168 693 | b = >30 2430 927 457 | c = <30 u   After SMOTE 200% using Naives Bayes === Confusion Matrix === a b c <-- classified as 15294 1439 1008 | a = NO 8445 2092 1672 | b = >30 4721 818 5981 | c = <30
  • 25.   25   u   Original Data using BayesNet === Confusion Matrix === a b c <-- classified as 13302 4867 526 | a = NO 5138 6440 513 | b = >30 1662 1800 352 | c = <30 u   After SMOTE 200% using BayesNet === Confusion Matrix === a b c <-- classified as 12746 4934 61 | a = NO 5802 6312 95 | b = >30 1714 2063 7743 | c = <30
  • 26.   26   References •   Beata Strack et al., “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.” •   Hindawi Publishing Corporation - http://www.hindawi.com/journals/bmri/2014/781670/tab1/ •   Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Elsevier, 2011 •   Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Diabetes+130- US+hospitals+for+years+1999-2008# •   Naïve Bayes for Dummies - http://blog.aylien.com/post/120703930533/naive-bayes-for- dummies-a-simple-explanation •   Steve Lohr, The New York Times, August 17, 2014, For Big-Data Scientists, ‘Janitor Work’ is Key Hurdle to Insights. •   Tamraparni Dasu and Theodore Johnson, Exploratory Data Mining and Data Quality, Wiley, 2004. •   Wikipedia - https://en.wikipedia.org/wiki/Data_cleansing https://en.wikipedia.org/wiki/Receiver_operating_characteristic https://en.wikipedia.org/wiki/Bayesian_network