1. Heart Data | 1
1
Christy Lee
Dana Alswyan
Thuan Nguyen
Business Analytics Project: Data on Heart Diseases
EXECUTIVE OVERVIEW
We have 2 datasets, LA Heart, which we did a logistic regression, and Cardiovas, which
we ran correlations and multiple regression analyses with respectively. From the LA Heart data,
we found that systolic blood pressure has a high probability of being related to having heart
disease. The Cardiovas data shows that the dependent variables hemoglobin A1C and blood
glucose each have independent variables that are moderately to highly significant to explaining
the outcome of their dependent variable. Therefore, to explain the results of hemoglobin A1C,
one must refer to levels of blood glucose and cholesterol, and take into account the patient’s age
and waist size. Also, to explain outcomes of blood glucose, one must see hemoglobin A1C levels
as well as age and weight.
LA HEART - LOGISTIC REGRESSION
The data for LA Heart was recorded in 1950. We analyzed three variables: cholesterol,
diastolic blood pressure, systolic blood pressure, and socioeconomic status, against whether the
patients under analysis are ill in terms of complications found in their cardiovascular system. The
dataset has 200 observations; 171 of those observations are healthy patients and 29 people are ill
patients.
We ran a binary logistic regression as a means to generate a probabilistic statistical
classification for the analysed variables. Convergence status per each independent variable has
been satisfied. Our dependent variable was whether the patient was ill (with some sort of heart
condition), and what role the following variables played in them being ill: systolic blood pressure
(mm Hg), diastolic blood pressure (mm Hg), cholesterol (mg%), and socioeconomic status
(Ordinal; 1=high,...,5=low).
The predictive power of the model, which is < 0.0001, is high since the p value is less
than 0.05, as seen in Figure 1. This means the prediction is significant, which is valuable to put
to practical reference.
2. Heart Data | 2
2
Figure 1. LA Heart Predictive Power
Our findings are also concordant; the log odds of the first observation are higher than the
second one. The model was predicted correctly, as seen in Figure 2.
Figure 2. LA Heart Concordance
The ROC curve testing showcases that the area under the curve= 0.9141 which is
classified as excellent (A). This attests to the accuracy of the test we ran as a whole.
Figure 3. LA Heart ROC Curve
3. Heart Data | 3
3
LA HEART - KEY FINDINGS
The 95% confidence interval for SBP_50 lies entirely above 1 to 1 odds, so we are
confident that the odds go up with (SBP_50), the log odds are positive. In Figure 4, we see the
variable with the highest confidence is SBP_50, so we can infer that as SBP_50 goes up, the
odds of having heart problems also increases. The 95% confidence interval of the variables
(DBP_50), (SES), and (CHOL_50) lie on both sides of 1 to 1 odds line, so we aren’t confident
that the odds go either up or down with the three variables mentioned above. The log odds
coefficient is not significant. The confidence interval for SES is especially broad, thus it is the
variable with the least confident odds.
Figure 4. LA Heart Odds Ratios
Also, the influence diagnostics show that the there are some data points that will
influence the confidence levels of each variable versus whether the patient is ill or not. For
example, the influence diagnostic graph for SES shows there are many data points that are far
from 0. Since there is a widespread, the confidence levels for SES are the widest/least confident,
as seen in Figure 5.
4. Heart Data | 4
4
Figure 5. LA Heart Influence Diagnostics
LA HEART - RESULTS
Our results show that the probability of having heart disease being related to the systolic
blood pressure is high. We see this from the influence diagnostics and how the spread for
SBP_50 is tighter compared to the other variables that were compared. Also, we have high
confidence as seen in Figure 4, that SBP_50 is probable due to it having both confidence limits
above 1, and the 95% confidence is also close together, indicating measurable and consistent
results. Socio-economic status has little to do with the probability of having heart disease, while
the other variables have weak probabilities.
CARDIOVAS - CORRELATION
The Cardiovas dataset consists of cardiovascular risk factor data, with 403 observations.
We first conducted a correlation analysis to see which variables would be most useful in our
linear and multiple regressions. We decided cholesterol, systolic blood pressure, and hemoglobin
A1C were dependent variables due to our correlation analysis in Figure 6. We conducted a
multiple regression for each dependent variable, each with 8 independent (explanatory)
variables.
5. Heart Data | 5
5
Figure 6. Cardiovas Correlation Analysis
The slightest blood glucose increase raises the risk of having heart disease.1
An increase
of cholesterol in the blood will build up in the walls of the arteries causing what is known as
“atherosclerosis”. There are two forms of cholesterol: Low-density lipoprotein LDL and it’s
known as “bad” cholesterol, and high-density lipoprotein HDL "good" cholesterol.2
We did not
include HDL as a factor in the analysis due to the fact that it is “good” cholesterol, while there
was no data found for LDL in the data set we selected. For the age factor, it is known that as
people grow older, the heart goes through many physiological changes; age could compound the
problems related to the heart if a cardiovascular disease existed.3
According to WebMD research,
people over the age of 50 have the highest chance of getting heart disease.
CARDIOVAS - LINEAR REGRESSION
In most people, systolic blood pressure rises steadily with age due to increasing stiffness
of large arteries, long-term build-up of plaque, and increased incidence of cardiac and vascular
disease.4
Systolic blood pressure as an independent variable could be a strong predictor for risk
1
MediLexicon International. "Glucose Increases Raise Heart Disease Risk." Medical News Today.
http://www.medicalnewstoday.com/articles/246612.php (accessed December 15, 2013).
2
WebMD. "Cholesterol and Heart Disease." WebMD. http://www.webmd.com/heart-disease/guide/heart-disease-
lower-cholesterol-risk (accessed December 13, 2013).
3
"Heart of the matter." Deccan Herald. http://www.deccanherald.com/content/302951/heart-matter.html (accessed
December 15, 2013).
4
"Understanding Blood Pressure Readings." American Heart Association .
http://www.heart.org/HEARTORG/Conditions/HighBloodPressure/AboutHighBloodPressure/Understanding-Blood-
Pressure-Readings_UCM_301764_Article.jsp# (accessed December 14, 2013).
6. Heart Data | 6
6
of cardiovascular diseases.5
Age also has a strong correlation with systolic blood pressure with a
44.3% correlation; and the r2 value of 19.63% in the linear regression test further proves the
point that systolic blood pressure is explained by age, as seen in Figure 7. Human weight also
plays a big role in showcasing whether a patient is at risk of developing a blockage in the heart.6
But the waist and the hip combined in a ratio are shown to be a better predictor of cardiovascular
diseases than body-mass index.7
Some signs of heart disease include a high level of hemoglobin
A1c. The concentration of hemoglobin A1c in the blood increases the risk for cardiovascular
diseases if found in the blood.8
Figure 7. Cardiovas Linear Regression on Age and Systolic Blood Pressure
In general, the higher the hemoglobin A1C, the higher the risk that a person can will
develop Heart disease. If hemoglobin A1C stays high for a long period of time, the risk for heart
problems is even greater. Our test results show that older people are more likely to have higher
levels of hemoglobin A1C. Diabetesjournals.org reports that hemoglobin A1C levels ≥5.5–6.0%
5
U.S. National Library of Medicine. "Elevated systolic blood pressure and risk of cardiovascular and renal disease."
National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/pubmed/10467215 (accessed
December 15, 2013).
6
"Weight & Waistlines: Heart Disease Risk Factors." WebMD. http://www.webmd.com/heart-
disease/features/weight-waistlines-heart-disease-risk (accessed December 15, 2013).
7
Wang, Z. "Waist Circumference, Body Mass Index, Hip Circumference and Waist-To-Hip Ratio as Predictors of
Cardiovascular Disease in Aboriginal People ." UThe University of Queensland.
http://espace.library.uq.edu.au/eserv.php?pid=UQ:9338&dsID=wh.pdf (accessed December 15, 2013).
8
U.S. National Library of Medicine. "Association of hemoglobin A1c with cardiovascular disease and mortality in
adults: the European prospective investigation into cancer in Norfolk.." National Center for Biotechnology
Information. http://www.ncbi.nlm.nih.gov/pubmed/15381514 (accessed December 15, 2013).
7. Heart Data | 7
7
is associated with incident heart failure in a middle-aged population, suggesting that hemoglobin
A1C in relation to older age contributes to development of heart failure.
Figure 8. Cardiovas Correlation of Hemoglobin A1C & Age
Figure 9. Cardiovas Linear Regression of Hemoglobin A1C & Age
CARDIOVAS - MULTIPLE REGRESSION TEST 1: CHOLESTEROL
The independent variables put into SAS against cholesterol are the following: Age, Blood
Glucose, Diastolic Blood Pressure, Hemoglobin A1C, Hip, Systolic Blood Pressure, Waist, and
Weight. We used a stepwise selection and found only 3 independents remained significant:
hemoglobin A1C, age, and diastolic blood pressure.
When looking at the r2
value, or how much each independent variable factors into
affecting cholesterol, diastolic blood pressure shows the highest r2
value with 0.1204. Thus, DBP
8. Heart Data | 8
8
has a 12.04% significance of affecting cholesterol levels. While age and hemoglobin A1C have
r2
values of 0.1002 and 0.0714 respectively
Figure 10. Cardiovas Cholesterol Stepwise Summary
CARDIOVAS - MULTI. REGR. TEST 2: SYSTOLIC BLOOD PRESSURE
The independent variables that are included in the model ran in SAS against systolic
blood pressure are the following: Age, Blood Glucose, Cholesterol, Diastolic Blood Pressure,
Hemoglobin A1C, Hip, Waist, and Weight. The stepwise selection showcased a relation towards
only four independent variables: diastolic blood pressure, age, hip, and weight. The r2
value
results for the variables entered were mostly high indicators with the following numerical values:
0.3686 for diastolic blood pressure which is 36.86% in effect towards systolic blood pressure,
0.5402 for age which is 54.02% as a factor towards the levels of systolic blood pressure, 0.5440
for hip independent variable which is 54.40%, and lastly 0.5476 for the weight factor which
accumulates for 54.76%. All the variables entered showcase a moderate level of significance
towards the dependent variable of systolic blood pressure.
Figure 11. Cardiovas Systolic Blood Pressure Stepwise Summary
CARDIOVAS - MULTI. REGR. TEST 3: HEMOGLOBIN A1C
The independent variables in the the third testing against hemoglobin A1C are as follows:
Age, Blood Glucose, Cholesterol, Diastolic Blood Pressure, Hip, Systolic Blood Pressure, Waist,
and Weight. The stepwise selection showcased a relation towards four independent variables:
blood glucose, cholesterol, age, and waist. The r2
value results for the variables entered
showcases a moderate to high level of significance towards hemoglobin A1C. The numerical
values for the entered variables were the following: 0.5542 for blood glucose which accumulates
for 55.42%, 0.5745 for cholesterol which accumulates for 57.45%, 0.5838 for the age variable
which is 58.38%, and lastly for the waist variable the r2
value showcased a 0.5869 which
accumulates for 58.69%.
9. Heart Data | 9
9
Figure 12. Cardiovas Hemoglobin A1C Stepwise Summary
CARDIOVAS - MULTI. REGR. TEST 4: BLOOD GLUCOSE
The independent variables that are included in the model ran in SAS against Blood
Glucose are the following: Age, Cholesterol, Diastolic Blood Pressure, Hemoglobin A1C, Hip,
Systolic Blood Pressure, Waist, and Weight. The stepwise selection showcased a relation
towards only four independent variables: hemoglobin A1C, weight, and age. The r2
value results
for the variables entered were mostly high indicators with the following numerical values:
0.5542 for hemoglobin A1C which is 55.42% as a factor towards the effect of blood glucose,
0.5579 for the weight variable which accumulates for 55.79%, and lastly 0.5605 for the age
independent variable and that accumulates for 56.05%. All the variables entered showcase a
moderate to high level of significance towards the dependent variable of blood glucose.
Figure 13. Cardiovas Blood Glucose Stepwise Summary
CARDIOVAS - MULTI. REGR. RESULTS
From the previous 4 multiple regressions, hemoglobin A1C’s independent variables as
well as the independent variables for blood glucose have the strongest (moderate-high)
significance to affecting their dependent variable; while the independent variables for systolic
blood pressure have moderate significance towards explaining; and cholesterol’s independent
variables have the weakest (low) significance to explaining the outcome of cholesterol.
Therefore, the independent variables for both hemoglobin A1C and blood glucose should
be referred to in order to explain how the results of each dependent variable came to be.
However, the independent variables for cholesterol, which have a weak significance, should not
be discarded as they still contribute to explaining some part of the outcome of their dependent
variable, albeit a rather small portion of it.
Our most valuable tests are Test 3 and Test 4, as the independent variables in each test
can explain at least 50% of the outcome seen in their dependent variable. Accordingly, to explain
10. Heart Data | 10
10
the results of hemoglobin A1C, levels of blood glucose, cholesterol, the patient’s age and waist
size should be taken into account. Additionally, to explain what affects blood glucose,
hemoglobin A1C levels as well as age and weight must be inspected.
CONCLUSIONS: LA HEART & CARDIOVAS
From the data we found on the LA Heart dataset, our logistic regression test illustrates
systolic blood pressure has a high probability of being related to developing heart disease, while
the other independent variables (diastolic blood pressure, cholesterol, and socioeconomic status)
have less to do with predicting heart illnesses. Both 95% confidence limits of the log odds for
systolic blood pressure are above 1 on the odds line. Influence diagnostics also shows that
systolic blood pressure has the most compact graph, showing that it has strong influence on
having heart illnesses.
The Cardiovas dataset shows us that hemoglobin A1C can be explained the best by waist
size with an r2
value of 0.5869, as seen in Figure 12, which is a moderate to strong significance.
The other independent variables - age, cholesterol, and blood glucose - also have moderate to
high significance to explaining hemoglobin A1C, however, waist size has the highest r2
value of
them all. Additionally, blood glucose can be best explained by the independent variable of age
with an r2
value of 0.5605, as seen in Figure 13. The other independent variables result in
moderate-high significance to affecting blood glucose and they include weight and hemoglobin
A1C.