- The document provides an analysis of the Haberman's breast cancer survival dataset which contains information on 306 female patients who underwent breast cancer surgery between 1958-1970.
- Exploratory data analysis techniques like univariate analysis, multivariate analysis, and data visualization using ggplot were used to understand relationships between age, year of operation, lymph node status and survival outcome.
- Key visualizations included bar plots to show survival rates, scatter plots to examine relationships between age and year of operation colored by survival status, and identifying trends over time.
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
Analysis of Haberman Dataset for Breast Cancer Survival
1. 1
4TH MAY 2019
Analysis on Haberman Dataset
Authored by: Manju Yadav (BSc Statistics)
Guidance of : Mr Pritesh Tiwari (Sr. DataScientist)
Business Requirements Document
2. 2
Document Approvals
Source
The data set is available for download from UCI machine learning repository.
https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
Donor : Tjen-Sien Lim (limt '@' stat.wisc.edu)
Dated : March 4, 1999
References
Haberman, S. J. (1976). “Generalized residuals for log-linear models”, Proceedings of the 9th International
Biometrics Conference, Boston, 104--122. Lichman, M. (2013). “UCI Machine Learning Repository”,
Irvine, CA: University of California, School of Information and Computer Science.
http://archive.ics.uci.edu/ml
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic
Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of
Wisconsin, Madison, WI.
https://www.breastcancer.org/symptoms/diagnosis/lymph_nodes
https://www.r-studio.com/Unformat_Help/systemrequirements.html
Date
Version
Number
Document Changes
03/05/2019 0.1 Initial Draft
3. 3
Content
Sr. No Title
1. Introduction
2. EDA
3. Technical installation
4. Programming
• Environment Configuration
• Data Preparation
• Data Subsetting
• Data Visualization (using ggplot)
o Histogram
o Density
o Barplot
o Boxplot
o Pairs
o Stack Bar
o Scatter Plot
o Scatter Plot1
5. Conclusion
4. 4
Introduction
It is a dataset of survival of women patients undergone breast cancer surgery, it has cases from a
study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the
survival of patients from the surgery.
Where prediction of the survival of the patients through analysis such as exploratory data analysis and to
give insights.
There are 3 attributes present in the data set where the dimension is 306 rows and 4 columns.
Hence the columns are specified as
1. Age (age from 30 to 83)
2. Year of operation (between 1958 to 1970)
3. Number of Positive Axillary nodes (Lymph Nodes)
4. Survival Status
o 1 = the patient survived more than 5 years or longer after surgery
o 2 = the patient died within 5 years after surgery
As there are no missing values and the dataset is already noise free available at UCI Repository.
Lymph Node: Lymph nodes are small, bean-shaped organs that act as filters along the lymph fluid
channels. As lymph fluid leaves the breast and eventually goes back into the bloodstream, the lymph nodes
try to catch and trap cancer cells before they reach other parts of the body. Having cancer cells in the lymph
nodes under your arm suggests an increased risk of the cancer spreading. In our data, it is axillary nodes
detected (0–52)
The lymph nodes in your armpit (axillary lymph nodes) are the most common place that cancer cells
would lodge, causing those nodes to swell.
5. 5
In context i.e. the more swelled nodes the more the chances of having cancer for the female
patients as its very unlikely or rare that men develop breast cancer .
If breast cancer spreads, the lymph nodes in the underarm (the axillary lymph nodes) are the first place it’s
likely to go.
Lymph node status is highly related to prognosis.
o Lymph node-negative means the axillary lymph nodes do not contain cancer.
o Lymph node-positive means the axillary lymph nodes contain cancer.
Prognosis is better when cancer has not spread to the lymph nodes (lymph node-negative)
Objectives of the project:
To understand the dataset
To apply exploratory data analysis.
Scope of the project:
To get answers for the data which will be analysed and learn the techniques of exploratory data
analysis.
To predict the important values from the analysis.
To predict the inter-relationship between the data
6. 6
Exploratory Data Analysis:
o Exploratory Data Analysis (EDA) is the series of asking questions and applying statistics and
visualization techniques on data to answer the questions and to find and understand the hidden
insights/information from the data.
o Setting the target is the key in EDA which will form or build your ideas in the entire analysis. In
Haberman dataset, the objective is to predict/forecast whether the patient will survive after 5 years
or not based on the patient’s age, year of operation and the number of positive (lymph) nodes.
Something about Multivariate Analysis:
o Multivariate Analysis— is performed to understand interactions between different fields in the
dataset.
o We have the dataset of haberman is also multivariate, hence we have to understand the relationship
and how it is depended on each other(an attribute depended on other)
o Dimensionality reduction — helps to understand the fields in the data that account for the most
variance between observations and allow for the processing of a reduced volume of data.
Something about Uni- variate Analysis:
o The major purpose of the Uni-variate analysis is to describe, summarize and find patterns in the
single feature.
Hardware Requirements:
RAM: - Minimum- 6GB-Win, 8GB-Mac; Recommended- 8GB
Storage: - Minimum- 7200RPM STATA with 20GB of available space, Recommended-SSD with
40GB of available space
Processor: - Minimum-Intel Core i3 2.5G hz, Recommended-Intel Core i5
Software Requirements:
R and R Studio
Excel
Word
1. Environment Configuration
Load the data set in a variable.
getwd()
## [1] " "C:/Users/hp/Documents"
setwd("D:/datasets/manju")
7. 7
getwd()
## [1] ""D:/datasets/manju""
hb<-read.csv("D:/manju/data/haberman.csv", header = T, sep = ",")
hb_data<-hb
# age year_operated p_node survival_status
# 1 30 64 1 1
# 2 30 62 3 1
# 3 30 65 0 1
# 4 31 59 2 1
# 5 31 65 4 1
# 6 33 58 10 1
# 7 33 60 0 1
# 8 34 59 0 2
# 9 34 66 9 2
# 10 34 58 30 1
1. Data Preparation
About the dataset once all set
# Checked the class of dataset.
class(hb)
## [1] "data.frame"
We get to know it’s a dataframe.
2. Understanding the summary
summary(hb)
## age year_operated p_nodes survival_status
## Min. :30.00 Min. :58.00 Min. : 0.000 Min. :1.000
## 1st Qu.:44.00 1st Qu. :60.00 1st Qu. : 0.000 1st Qu.:1.000
## Median :52.00 Median :63.00 Median :1.000 Median :1.000
## Mean :52.46 Mean :62.85 Mean : 4.026 Mean :1.265
## 3rd Qu.:60.75 3rd Qu. :65.75 3rd Qu. : 4.000 3rd Qu.:2.000
## Max. :83.00 Max. :69.00 Max. :52.000 Max. :2.000
o Here we understand from age summary that there is minimum age in the data set is 30 and max is 83
underwent for surgery,
o In year of operation the start year is 1958 and end year is 1969 where operation started and ended in the
dataset
o Positive axillary nodes show minimum is 0 and maximum in a patient is 52 where the percentage for that
person is highly likely to be diagnosed with cancer
8. 8
o Also got the quantiles, mean, median of the dataset.
# variance of all attributes.
var(hb)
## age year_operated p_node survival_status
## age 116.7145827 3.142912247 -4.9070824 0.324397300
## year_operated 3.1429122 10.558630665 -0.0879460 -0.006846673
## p_nodes -4.9070824 -0.087945998 51.6911175 0.911089682
## s survival_status 0.3243973 -0.006846673 0.9110897 0.195274831
o Here we get the variance where variance is average squares of distance between (actual point-estimated
point/predicted point)
o Variance also tends to give an error where this error gives the actual distance between average sum of squares.
# Overall structure of the dataset.
str(hb)
## 'data.frame': 306 obs. of 4 variables:
## $ age : double 30 30 30 31 31 33 33 34 34 34 ...
## $ year_operated : double 64 62 65 59 65 58 60 59 66 58 ...
## $ p_node : double 1 3 0 2 4 10 0 0 9 30 ...
## $ survival_status : double 1 1 1 1 1 1 1 2 2 1 ...
o Here got the structure of Haberman dataset. In which found the class, dimension, attributes names
and datatypes, n values of each attribute. The dataset contains 306 observation and 4 predictors or
attributes and each attribute have “double” datatype.
#Column names of Haberman Dataset
colnames(hb)
## [1] "age" "year_operated" "p_node" "survival_status"
Here we got the colnames of 4 different related attributes.
# head function
head(hb)
## age year_operated p_node survival_status
## 1 30 64 1 1
## 2 30 62 3 1
## 3 30 65 0 1
## 4 31 59 2 1
## 5 31 65 4 1
## 6 33 58 10 1
9. 9
Using head(), got the first 6 rows and columns from the dataset.
# tail function
tail(hb)
## age year_operated p_node survival_status
## 301 74 63 0 1
## 302 75 62 1 1
## 303 76 67 0 1
## 304 77 65 3 1
## 305 78 65 1 2
## 306 83 58 2 2
Using tail(), got the last 6 rows and columns from the dataset.
# Range (we get the start and the end point in a dataset of a particular attribute)
Range(hb$age)
## [1] 30 83
Range(hb$year_operated)
## [1] 58 69
Range(hb$p_nodes)
## [1] 0 52
Here got the same as the summary value observations done
# table() function
table(hb$age)
## 30 31 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## 3 2 2 7 2 2 6 10 6 3 10 9 11 7 9 7 11 7 10 12 6 14 11 13 10
## 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 83
## 7 11 7 8 6 9 7 8 5 10 5 6 2 4 7 1 4 2 2 1 1 1 1 1
table(hb$year_operated)
## 58 59 60 61 62 63 64 65 66 67 68 69
## 36 27 28 26 23 30 31 28 28 25 13 11
table(hb$p_node)
10. 10
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 136 41 20 20 13 6 7 7 7 6 3 4 2 5 4 3 1 1
## 18 19 20 21 22 23 24 25 28 30 35 46 52
## 1 3 2 1 3 3 1 1 1 1 1 1 1
table(hb$survival_status)
## 1 2
## 225 81
o In the table(hb$p_node) shows here that 136 people have 0 nodes,1 person has 52 nodes,41
people have 1 node and also 20 people have 2 nodes and so on.
Data Subsetting :
Subsetting is a way of getting just the required data of large dataset in which an analyst is
interested.
hb_sub1<-subset(hb, survival_status == '1')
hb_sub2<-subset(hb,survival_status=='2')
hb_sub2
hb_sub1
# age year_operated p_node survival_status
# 1 34 59 0 2
# 2 34 66 9 2
# 3 38 69 21 2
# 4 39 66 0 2
# 5 41 60 23 2
# 6 41 64 0 2
# 7 41 67 0 2
# 8 42 69 1 2
# 9 42 59 0 2
# 10 43 58 52 2 And so on..
Application of Grammer of graphics (ggplot)
We can create various visualization and understand the dataset more interactively and make a non-
data understanding person to give insights where he can understand what is going on in the data set by
finding correlation’s and references and ways of formulating analysis.
11. 11
Installation of the packages
install.packages("ggplot2")
library(ggplot2)
#Visualisation using Barplot
#we created a bar_plot for survival_status of patients
hb_bar<-ggplot(data=hb,aes(x=hb$survival_status))+geom_bar()
#from here we understood that these many people survived and didn’t survive
chart_hb<-ggplot(data = hb, aes(x=hb$survival_status))+geom_bar()
chart_hb1<-chart_hb+labs(title="number of survival vs didnt survive ",x="Survival Status",y="Number
of patients")
chart_hb2<- chart_hb1+geom_text(stat = 'Count',aes(label=..count..),vjust=-0.25)
#here we add count to the barplot
12. 12
Observations:
o Shows a clear indication that 225 patients survived and 81 patients died
o Through count function and updating the x and y axis labelling which is called as scale.
o The plot shows us the clear picture of the survival number in the data of 306 patients
13. 13
Visualisation using Scatter_plot
o scatter_hb<-ggplot(data=hb_data,aes(x=hb_data$year_operated,y=hb_data$age,shape=
hb_data $survival_status, color = hb_data $ survival_status))+geom_point()
o It shows the age factor of people surviving or not in the year of operation
o Where the color variety shows the criticalness of the nodes
o Triangle indication is of death within 5 years
o The circle indicates living more than five years
o From this, we get scatter and also point and also being used smooth function
o We get to know that its age v/s year_operated(year of operation in 1900)
o Color function gives us 2 colors for the survival rate
Observations:
o That in year 1958 there are around 24 people/women went under operation for breast cancer that
5 didn’t live upto 5 years or died within 5 years
o It can be noticed as the years passes the death rate is decreasing
o Also that in the year 1965, 10 people died
14. 14
#Visualisation Using Geom_point
hb_scatter1<-ggplot(data=hb, aes(x=year,y=age,shape=survival_status,color = s_status)) +
geom_point() +geom_smooth()
hb_scatter1
o
Observation:
o The blue line indicated that as the year of operation as increases then the chances of survival
increases as mortality rate decreses
o Which in context tells as the doctors learned about the problem, more the experience they got they
were able to save the person with the problem of breast cancer
15. 15
#Visualisation Using Stack Bar
hab_stack<- ggplot(data = haberman,aes(x=age,y=year_operated,fill=survival_status))
hab_stack1<-hab_stack+geom_bar(stat = "identity")
hab_stack3<- hab_stack+geom_bar(stat = "identity",position = position_dodge())
#here we can see that fewer patients in later years hardly survived 5 years ex 83years old patient
#and fewer young patients have not survived till the age of 40
#most of the people between 41 years of age till 67 years have shown most deaths within 5 years
o We took into consideration a plot year_operated v/s age and with indication of survival
status
Stack bar(p_node v/s age with indication of survival_status)
hb_stack3<- ggplot(data = hb, aes(x=age,y=p_node ,fill=survival_status))
hb_stack4<-hab_stack3+geom_bar(stat = "identity")
hb_stack5<- hab_stack3+geom_bar(stat = "identity",position = position_dodge())
16. 16
o Here its indicating more the p_nodes less the survival rate
o With wrong prognosis then there is less chances of survival of the patients
17. 17
#Visualisation Using Density plot
hb_den<- ggplot(data = hb_data,aes(x=hb_data$year_operated color=survival_status))+geom_density()
#here it shows the operations were less from 59.5 to 64.5 of patients and grew till 1965 and gradually
reduced from 1966
o In this plot the black circle shows the patients whose survived more than 5 year considering the
age, year of surgery and positive nodes and red circle shows the patients whose died within the 5
years
o Density shows the distribution of numeric variable
18. 18
#Visualization using Pairs
#pairs(hb_data[,1:3], col= hb_data[,4], oma=c(4,4,6,12))
par (xpd= TRUE)
legend (0.85,0.6,as.vector(unique(hb_data$survival_status)),fill = c(1,2))
o later on we pass the plots attribute using oma() to display the plot.
o Then we divide the window in two parts using par() function.
o first window there is plot and second window there is legend.
o Legend we put position and then output attribute.
o pairs we plot particualr attribute then output columns attribute which is survival status
o Here we created the plot where all the attributes except survival status using ggplot2 package.
19. 19
Visualisation Using Histogram
hb_hist <- ggplot(data=hb_data,aes(x = age, color = survival_status)) + geom_histogram(binwidth =1,
fill= "White")
hb_hist(to call the assignment)
Observations:
o Here we found the survival status and corresponding to age
o It showcases that fewer number of people have died in blue colour histogram
o Bigger number of people have survived in red colour histogram plot
o There is one 83 year old who didn’t survive
20. 20
Visualization using boxplot
boxplot(hb_data$p_node, main = "Boxplot of Positive Nodes", xlab = "P_Nodes", ylab =
"Counts", col = "yellow")
boxplot(hb_subset1[,1:3], main = "Patients Survived 5 years or longer", las = 2,col = rainbow(3))
boxplot(hb_subset2[,1:3], main = "Patients have not Survived 5 years", las = 2,col = rainbow(3))
o 1st
is for positive auxiliary nodes
o 2nd
and 3rd
is for survival for less or more than 5 years
o first boxplot min is 4.026 and outliers starts from 10 and ends with 52
Observations:
o Here it shows that the outliers are present in the p_node as that the data ranges from 0 to 52 hence
indicates the median is so less and outliers are present
o Hence shows 52 which is at an extreme extent
o Where due to median the extent values wouldn’t affect the data set such as happens in mean
21. 21
Visualisation using Decision tree:
model <- rpart (survival_status ~.,
#here ~ denotes that in corelation/from this only to take decision
data = hb_data, control = rpart.control(minsplit = 3))
plot(model, compress = TRUE)
text(model, cex = 0.5, use.n = TRUE, fancy = FALSE, all = TRUE)
Observations:
o Here we created a decision tree where with the minimum split is 3.
o It has plotted the whole dataset given its survival status
o It shows values who are in 1 but tending towards 2 such as 1.808 such values are also present
which can give an idea that in future the person can die rather than surviving for more than 5 years
22. 22
Conclusion
o Although the maximum number of positive lymph nodes observed is 52,
nearly 75% of the patients have less than 5 positive lymph nodes and nearly
25% of the patients have no positive lymph nodes.
o Almost 80% of the patients have less than or equal to 5 positive lymph nodes
o As per previous records we found that only 225 people survive and 81 people
have not survived.