SlideShare a Scribd company logo
1 of 22
Download to read offline
1
4TH MAY 2019
Analysis on Haberman Dataset
Authored by: Manju Yadav (BSc Statistics)
Guidance of : Mr Pritesh Tiwari (Sr. DataScientist)
Business Requirements Document
2
Document Approvals
Source
The data set is available for download from UCI machine learning repository.
https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival
Donor : Tjen-Sien Lim (limt '@' stat.wisc.edu)
Dated : March 4, 1999
References
Haberman, S. J. (1976). “Generalized residuals for log-linear models”, Proceedings of the 9th International
Biometrics Conference, Boston, 104--122. Lichman, M. (2013). “UCI Machine Learning Repository”,
Irvine, CA: University of California, School of Information and Computer Science.
http://archive.ics.uci.edu/ml
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic
Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83.
Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of
Wisconsin, Madison, WI.
https://www.breastcancer.org/symptoms/diagnosis/lymph_nodes
https://www.r-studio.com/Unformat_Help/systemrequirements.html
Date
Version
Number
Document Changes
03/05/2019 0.1 Initial Draft
3
Content
Sr. No Title
1. Introduction
2. EDA
3. Technical installation
4. Programming
• Environment Configuration
• Data Preparation
• Data Subsetting
• Data Visualization (using ggplot)
o Histogram
o Density
o Barplot
o Boxplot
o Pairs
o Stack Bar
o Scatter Plot
o Scatter Plot1
5. Conclusion
4
Introduction
It is a dataset of survival of women patients undergone breast cancer surgery, it has cases from a
study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the
survival of patients from the surgery.
Where prediction of the survival of the patients through analysis such as exploratory data analysis and to
give insights.
There are 3 attributes present in the data set where the dimension is 306 rows and 4 columns.
Hence the columns are specified as
1. Age (age from 30 to 83)
2. Year of operation (between 1958 to 1970)
3. Number of Positive Axillary nodes (Lymph Nodes)
4. Survival Status
o 1 = the patient survived more than 5 years or longer after surgery
o 2 = the patient died within 5 years after surgery
As there are no missing values and the dataset is already noise free available at UCI Repository.
Lymph Node: Lymph nodes are small, bean-shaped organs that act as filters along the lymph fluid
channels. As lymph fluid leaves the breast and eventually goes back into the bloodstream, the lymph nodes
try to catch and trap cancer cells before they reach other parts of the body. Having cancer cells in the lymph
nodes under your arm suggests an increased risk of the cancer spreading. In our data, it is axillary nodes
detected (0–52)
The lymph nodes in your armpit (axillary lymph nodes) are the most common place that cancer cells
would lodge, causing those nodes to swell.
5
In context i.e. the more swelled nodes the more the chances of having cancer for the female
patients as its very unlikely or rare that men develop breast cancer .
If breast cancer spreads, the lymph nodes in the underarm (the axillary lymph nodes) are the first place it’s
likely to go.
Lymph node status is highly related to prognosis.
o Lymph node-negative means the axillary lymph nodes do not contain cancer.
o Lymph node-positive means the axillary lymph nodes contain cancer.
Prognosis is better when cancer has not spread to the lymph nodes (lymph node-negative)
Objectives of the project:
 To understand the dataset
 To apply exploratory data analysis.
Scope of the project:
 To get answers for the data which will be analysed and learn the techniques of exploratory data
analysis.
 To predict the important values from the analysis.
 To predict the inter-relationship between the data
6
Exploratory Data Analysis:
o Exploratory Data Analysis (EDA) is the series of asking questions and applying statistics and
visualization techniques on data to answer the questions and to find and understand the hidden
insights/information from the data.
o Setting the target is the key in EDA which will form or build your ideas in the entire analysis. In
Haberman dataset, the objective is to predict/forecast whether the patient will survive after 5 years
or not based on the patient’s age, year of operation and the number of positive (lymph) nodes.
Something about Multivariate Analysis:
o Multivariate Analysis— is performed to understand interactions between different fields in the
dataset.
o We have the dataset of haberman is also multivariate, hence we have to understand the relationship
and how it is depended on each other(an attribute depended on other)
o Dimensionality reduction — helps to understand the fields in the data that account for the most
variance between observations and allow for the processing of a reduced volume of data.
Something about Uni- variate Analysis:
o The major purpose of the Uni-variate analysis is to describe, summarize and find patterns in the
single feature.
Hardware Requirements:
RAM: - Minimum- 6GB-Win, 8GB-Mac; Recommended- 8GB
Storage: - Minimum- 7200RPM STATA with 20GB of available space, Recommended-SSD with
40GB of available space
Processor: - Minimum-Intel Core i3 2.5G hz, Recommended-Intel Core i5
Software Requirements:
R and R Studio
Excel
Word
1. Environment Configuration
Load the data set in a variable.
getwd()
## [1] " "C:/Users/hp/Documents"
setwd("D:/datasets/manju")
7
getwd()
## [1] ""D:/datasets/manju""
hb<-read.csv("D:/manju/data/haberman.csv", header = T, sep = ",")
hb_data<-hb
# age year_operated p_node survival_status
# 1 30 64 1 1
# 2 30 62 3 1
# 3 30 65 0 1
# 4 31 59 2 1
# 5 31 65 4 1
# 6 33 58 10 1
# 7 33 60 0 1
# 8 34 59 0 2
# 9 34 66 9 2
# 10 34 58 30 1
1. Data Preparation
About the dataset once all set
# Checked the class of dataset.
class(hb)
## [1] "data.frame"
We get to know it’s a dataframe.
2. Understanding the summary
summary(hb)
## age year_operated p_nodes survival_status
## Min. :30.00 Min. :58.00 Min. : 0.000 Min. :1.000
## 1st Qu.:44.00 1st Qu. :60.00 1st Qu. : 0.000 1st Qu.:1.000
## Median :52.00 Median :63.00 Median :1.000 Median :1.000
## Mean :52.46 Mean :62.85 Mean : 4.026 Mean :1.265
## 3rd Qu.:60.75 3rd Qu. :65.75 3rd Qu. : 4.000 3rd Qu.:2.000
## Max. :83.00 Max. :69.00 Max. :52.000 Max. :2.000
o Here we understand from age summary that there is minimum age in the data set is 30 and max is 83
underwent for surgery,
o In year of operation the start year is 1958 and end year is 1969 where operation started and ended in the
dataset
o Positive axillary nodes show minimum is 0 and maximum in a patient is 52 where the percentage for that
person is highly likely to be diagnosed with cancer
8
o Also got the quantiles, mean, median of the dataset.
# variance of all attributes.
var(hb)
## age year_operated p_node survival_status
## age 116.7145827 3.142912247 -4.9070824 0.324397300
## year_operated 3.1429122 10.558630665 -0.0879460 -0.006846673
## p_nodes -4.9070824 -0.087945998 51.6911175 0.911089682
## s survival_status 0.3243973 -0.006846673 0.9110897 0.195274831
o Here we get the variance where variance is average squares of distance between (actual point-estimated
point/predicted point)
o Variance also tends to give an error where this error gives the actual distance between average sum of squares.
# Overall structure of the dataset.
str(hb)
## 'data.frame': 306 obs. of 4 variables:
## $ age : double 30 30 30 31 31 33 33 34 34 34 ...
## $ year_operated : double 64 62 65 59 65 58 60 59 66 58 ...
## $ p_node : double 1 3 0 2 4 10 0 0 9 30 ...
## $ survival_status : double 1 1 1 1 1 1 1 2 2 1 ...
o Here got the structure of Haberman dataset. In which found the class, dimension, attributes names
and datatypes, n values of each attribute. The dataset contains 306 observation and 4 predictors or
attributes and each attribute have “double” datatype.
#Column names of Haberman Dataset
colnames(hb)
## [1] "age" "year_operated" "p_node" "survival_status"
Here we got the colnames of 4 different related attributes.
# head function
head(hb)
## age year_operated p_node survival_status
## 1 30 64 1 1
## 2 30 62 3 1
## 3 30 65 0 1
## 4 31 59 2 1
## 5 31 65 4 1
## 6 33 58 10 1
9
Using head(), got the first 6 rows and columns from the dataset.
# tail function
tail(hb)
## age year_operated p_node survival_status
## 301 74 63 0 1
## 302 75 62 1 1
## 303 76 67 0 1
## 304 77 65 3 1
## 305 78 65 1 2
## 306 83 58 2 2
Using tail(), got the last 6 rows and columns from the dataset.
# Range (we get the start and the end point in a dataset of a particular attribute)
Range(hb$age)
## [1] 30 83
Range(hb$year_operated)
## [1] 58 69
Range(hb$p_nodes)
## [1] 0 52
Here got the same as the summary value observations done
# table() function
table(hb$age)
## 30 31 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
## 3 2 2 7 2 2 6 10 6 3 10 9 11 7 9 7 11 7 10 12 6 14 11 13 10
## 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 83
## 7 11 7 8 6 9 7 8 5 10 5 6 2 4 7 1 4 2 2 1 1 1 1 1
table(hb$year_operated)
## 58 59 60 61 62 63 64 65 66 67 68 69
## 36 27 28 26 23 30 31 28 28 25 13 11
table(hb$p_node)
10
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 136 41 20 20 13 6 7 7 7 6 3 4 2 5 4 3 1 1
## 18 19 20 21 22 23 24 25 28 30 35 46 52
## 1 3 2 1 3 3 1 1 1 1 1 1 1
table(hb$survival_status)
## 1 2
## 225 81
o In the table(hb$p_node) shows here that 136 people have 0 nodes,1 person has 52 nodes,41
people have 1 node and also 20 people have 2 nodes and so on.
Data Subsetting :
Subsetting is a way of getting just the required data of large dataset in which an analyst is
interested.
hb_sub1<-subset(hb, survival_status == '1')
hb_sub2<-subset(hb,survival_status=='2')
hb_sub2
hb_sub1
# age year_operated p_node survival_status
# 1 34 59 0 2
# 2 34 66 9 2
# 3 38 69 21 2
# 4 39 66 0 2
# 5 41 60 23 2
# 6 41 64 0 2
# 7 41 67 0 2
# 8 42 69 1 2
# 9 42 59 0 2
# 10 43 58 52 2 And so on..
Application of Grammer of graphics (ggplot)
We can create various visualization and understand the dataset more interactively and make a non-
data understanding person to give insights where he can understand what is going on in the data set by
finding correlation’s and references and ways of formulating analysis.
11
Installation of the packages
install.packages("ggplot2")
library(ggplot2)
#Visualisation using Barplot
#we created a bar_plot for survival_status of patients
hb_bar<-ggplot(data=hb,aes(x=hb$survival_status))+geom_bar()
#from here we understood that these many people survived and didn’t survive
chart_hb<-ggplot(data = hb, aes(x=hb$survival_status))+geom_bar()
chart_hb1<-chart_hb+labs(title="number of survival vs didnt survive ",x="Survival Status",y="Number
of patients")
chart_hb2<- chart_hb1+geom_text(stat = 'Count',aes(label=..count..),vjust=-0.25)
#here we add count to the barplot
12
Observations:
o Shows a clear indication that 225 patients survived and 81 patients died
o Through count function and updating the x and y axis labelling which is called as scale.
o The plot shows us the clear picture of the survival number in the data of 306 patients
13
Visualisation using Scatter_plot
o scatter_hb<-ggplot(data=hb_data,aes(x=hb_data$year_operated,y=hb_data$age,shape=
hb_data $survival_status, color = hb_data $ survival_status))+geom_point()
o It shows the age factor of people surviving or not in the year of operation
o Where the color variety shows the criticalness of the nodes
o Triangle indication is of death within 5 years
o The circle indicates living more than five years
o From this, we get scatter and also point and also being used smooth function
o We get to know that its age v/s year_operated(year of operation in 1900)
o Color function gives us 2 colors for the survival rate
Observations:
o That in year 1958 there are around 24 people/women went under operation for breast cancer that
5 didn’t live upto 5 years or died within 5 years
o It can be noticed as the years passes the death rate is decreasing
o Also that in the year 1965, 10 people died
14
#Visualisation Using Geom_point
hb_scatter1<-ggplot(data=hb, aes(x=year,y=age,shape=survival_status,color = s_status)) +
geom_point() +geom_smooth()
hb_scatter1
o
Observation:
o The blue line indicated that as the year of operation as increases then the chances of survival
increases as mortality rate decreses
o Which in context tells as the doctors learned about the problem, more the experience they got they
were able to save the person with the problem of breast cancer
15
#Visualisation Using Stack Bar
hab_stack<- ggplot(data = haberman,aes(x=age,y=year_operated,fill=survival_status))
hab_stack1<-hab_stack+geom_bar(stat = "identity")
hab_stack3<- hab_stack+geom_bar(stat = "identity",position = position_dodge())
#here we can see that fewer patients in later years hardly survived 5 years ex 83years old patient
#and fewer young patients have not survived till the age of 40
#most of the people between 41 years of age till 67 years have shown most deaths within 5 years
o We took into consideration a plot year_operated v/s age and with indication of survival
status
Stack bar(p_node v/s age with indication of survival_status)
hb_stack3<- ggplot(data = hb, aes(x=age,y=p_node ,fill=survival_status))
hb_stack4<-hab_stack3+geom_bar(stat = "identity")
hb_stack5<- hab_stack3+geom_bar(stat = "identity",position = position_dodge())
16
o Here its indicating more the p_nodes less the survival rate
o With wrong prognosis then there is less chances of survival of the patients
17
#Visualisation Using Density plot
hb_den<- ggplot(data = hb_data,aes(x=hb_data$year_operated color=survival_status))+geom_density()
#here it shows the operations were less from 59.5 to 64.5 of patients and grew till 1965 and gradually
reduced from 1966
o In this plot the black circle shows the patients whose survived more than 5 year considering the
age, year of surgery and positive nodes and red circle shows the patients whose died within the 5
years
o Density shows the distribution of numeric variable
18
#Visualization using Pairs
#pairs(hb_data[,1:3], col= hb_data[,4], oma=c(4,4,6,12))
par (xpd= TRUE)
legend (0.85,0.6,as.vector(unique(hb_data$survival_status)),fill = c(1,2))
o later on we pass the plots attribute using oma() to display the plot.
o Then we divide the window in two parts using par() function.
o first window there is plot and second window there is legend.
o Legend we put position and then output attribute.
o pairs we plot particualr attribute then output columns attribute which is survival status
o Here we created the plot where all the attributes except survival status using ggplot2 package.
19
Visualisation Using Histogram
hb_hist <- ggplot(data=hb_data,aes(x = age, color = survival_status)) + geom_histogram(binwidth =1,
fill= "White")
hb_hist(to call the assignment)
Observations:
o Here we found the survival status and corresponding to age
o It showcases that fewer number of people have died in blue colour histogram
o Bigger number of people have survived in red colour histogram plot
o There is one 83 year old who didn’t survive
20
Visualization using boxplot
boxplot(hb_data$p_node, main = "Boxplot of Positive Nodes", xlab = "P_Nodes", ylab =
"Counts", col = "yellow")
boxplot(hb_subset1[,1:3], main = "Patients Survived 5 years or longer", las = 2,col = rainbow(3))
boxplot(hb_subset2[,1:3], main = "Patients have not Survived 5 years", las = 2,col = rainbow(3))
o 1st
is for positive auxiliary nodes
o 2nd
and 3rd
is for survival for less or more than 5 years
o first boxplot min is 4.026 and outliers starts from 10 and ends with 52
Observations:
o Here it shows that the outliers are present in the p_node as that the data ranges from 0 to 52 hence
indicates the median is so less and outliers are present
o Hence shows 52 which is at an extreme extent
o Where due to median the extent values wouldn’t affect the data set such as happens in mean
21
Visualisation using Decision tree:
model <- rpart (survival_status ~.,
#here ~ denotes that in corelation/from this only to take decision
data = hb_data, control = rpart.control(minsplit = 3))
plot(model, compress = TRUE)
text(model, cex = 0.5, use.n = TRUE, fancy = FALSE, all = TRUE)
Observations:
o Here we created a decision tree where with the minimum split is 3.
o It has plotted the whole dataset given its survival status
o It shows values who are in 1 but tending towards 2 such as 1.808 such values are also present
which can give an idea that in future the person can die rather than surviving for more than 5 years
22
Conclusion
o Although the maximum number of positive lymph nodes observed is 52,
nearly 75% of the patients have less than 5 positive lymph nodes and nearly
25% of the patients have no positive lymph nodes.
o Almost 80% of the patients have less than or equal to 5 positive lymph nodes
o As per previous records we found that only 225 people survive and 81 people
have not survived.

More Related Content

What's hot

Elementary Data Analysis with MS Excel_Day-4
Elementary Data Analysis with MS Excel_Day-4Elementary Data Analysis with MS Excel_Day-4
Elementary Data Analysis with MS Excel_Day-4Redwan Ferdous
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorialtekslate1
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processingSamraiz Tejani
 
My own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionMy own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionGabriele Mineo
 
Démo Big Data Paris - Détection de Fraude
Démo Big Data Paris - Détection de FraudeDémo Big Data Paris - Détection de Fraude
Démo Big Data Paris - Détection de FraudeNeo4j
 
如何快速实现数据编织架构
如何快速实现数据编织架构如何快速实现数据编织架构
如何快速实现数据编织架构Denodo
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopNajima Begum
 
4. R- files Reading and Writing
4. R- files Reading and Writing4. R- files Reading and Writing
4. R- files Reading and Writingkrishna singh
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDataminingTools Inc
 
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.pptChapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.pptSubrata Kumer Paul
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !Denodo
 
Plotly dash and data visualisation in Python
Plotly dash and data visualisation in PythonPlotly dash and data visualisation in Python
Plotly dash and data visualisation in PythonVolodymyr Kazantsev
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse designines beltaief
 

What's hot (20)

Elementary Data Analysis with MS Excel_Day-4
Elementary Data Analysis with MS Excel_Day-4Elementary Data Analysis with MS Excel_Day-4
Elementary Data Analysis with MS Excel_Day-4
 
Datastage free tutorial
Datastage free tutorialDatastage free tutorial
Datastage free tutorial
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 
My own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer PredictionMy own Machine Learning project - Breast Cancer Prediction
My own Machine Learning project - Breast Cancer Prediction
 
Démo Big Data Paris - Détection de Fraude
Démo Big Data Paris - Détection de FraudeDémo Big Data Paris - Détection de Fraude
Démo Big Data Paris - Détection de Fraude
 
如何快速实现数据编织架构
如何快速实现数据编织架构如何快速实现数据编织架构
如何快速实现数据编织架构
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using Hadoop
 
4. R- files Reading and Writing
4. R- files Reading and Writing4. R- files Reading and Writing
4. R- files Reading and Writing
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
ETL QA
ETL QAETL QA
ETL QA
 
Short story.pptx
Short story.pptxShort story.pptx
Short story.pptx
 
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.pptChapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
 
Characterization
CharacterizationCharacterization
Characterization
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
BIG MART SALES.pptx
BIG MART SALES.pptxBIG MART SALES.pptx
BIG MART SALES.pptx
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !
 
Plotly dash and data visualisation in Python
Plotly dash and data visualisation in PythonPlotly dash and data visualisation in Python
Plotly dash and data visualisation in Python
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 

Similar to Analysis of Haberman Dataset for Breast Cancer Survival

A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
 
Multivariate Regression using Skull Structures
Multivariate Regression using Skull StructuresMultivariate Regression using Skull Structures
Multivariate Regression using Skull StructuresJustin Pierce
 
Computational Biomedicine Lab: Current Members, pumpsandpipesmdhc
Computational Biomedicine Lab: Current Members, pumpsandpipesmdhcComputational Biomedicine Lab: Current Members, pumpsandpipesmdhc
Computational Biomedicine Lab: Current Members, pumpsandpipesmdhctmhsweb
 
10_PPT__ML.docx
10_PPT__ML.docx10_PPT__ML.docx
10_PPT__ML.docxranvir20
 
ACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptxACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptxranvir20
 
Setting the stage with beginning data analyses
Setting the stage with beginning data analysesSetting the stage with beginning data analyses
Setting the stage with beginning data analyseshuebner14
 
A comparative study of cn2 rule and svm algorithm
A comparative study of cn2 rule and svm algorithmA comparative study of cn2 rule and svm algorithm
A comparative study of cn2 rule and svm algorithmAlexander Decker
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Austin Benson
 
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...Servio Fernando Lima Reina
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
 
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsA Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsTracy Hill
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
A Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction TechniquesA Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction Techniquesijtsrd
 
Iganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer ThreatsIganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer Threatsijsrd.com
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management inscit2006
 

Similar to Analysis of Haberman Dataset for Breast Cancer Survival (20)

A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
 
Multivariate Regression using Skull Structures
Multivariate Regression using Skull StructuresMultivariate Regression using Skull Structures
Multivariate Regression using Skull Structures
 
Computational Biomedicine Lab: Current Members, pumpsandpipesmdhc
Computational Biomedicine Lab: Current Members, pumpsandpipesmdhcComputational Biomedicine Lab: Current Members, pumpsandpipesmdhc
Computational Biomedicine Lab: Current Members, pumpsandpipesmdhc
 
10_PPT__ML.docx
10_PPT__ML.docx10_PPT__ML.docx
10_PPT__ML.docx
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
ACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptxACHIVERS TEAM _PPT__ML.pptx
ACHIVERS TEAM _PPT__ML.pptx
 
Setting the stage with beginning data analyses
Setting the stage with beginning data analysesSetting the stage with beginning data analyses
Setting the stage with beginning data analyses
 
A comparative study of cn2 rule and svm algorithm
A comparative study of cn2 rule and svm algorithmA comparative study of cn2 rule and svm algorithm
A comparative study of cn2 rule and svm algorithm
 
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
[IJET-V2I3P21] Authors: Amit Kumar Dewangan, Akhilesh Kumar Shrivas, Prem Kumar
 
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsA Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
 
Major project.pptx
Major project.pptxMajor project.pptx
Major project.pptx
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
A Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction TechniquesA Survey on Heart Disease Prediction Techniques
A Survey on Heart Disease Prediction Techniques
 
Iganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer ThreatsIganfis Data Mining Approach for Forecasting Cancer Threats
Iganfis Data Mining Approach for Forecasting Cancer Threats
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 

Recently uploaded

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

Analysis of Haberman Dataset for Breast Cancer Survival

  • 1. 1 4TH MAY 2019 Analysis on Haberman Dataset Authored by: Manju Yadav (BSc Statistics) Guidance of : Mr Pritesh Tiwari (Sr. DataScientist) Business Requirements Document
  • 2. 2 Document Approvals Source The data set is available for download from UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Haberman's+Survival Donor : Tjen-Sien Lim (limt '@' stat.wisc.edu) Dated : March 4, 1999 References Haberman, S. J. (1976). “Generalized residuals for log-linear models”, Proceedings of the 9th International Biometrics Conference, Boston, 104--122. Lichman, M. (2013). “UCI Machine Learning Repository”, Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), Graphical Models for Assessing Logistic Regression Models (with discussion), Journal of the American Statistical Association 79: 61-83. Lo, W.-D. (1993). Logistic Regression Trees, PhD thesis, Department of Statistics, University of Wisconsin, Madison, WI. https://www.breastcancer.org/symptoms/diagnosis/lymph_nodes https://www.r-studio.com/Unformat_Help/systemrequirements.html Date Version Number Document Changes 03/05/2019 0.1 Initial Draft
  • 3. 3 Content Sr. No Title 1. Introduction 2. EDA 3. Technical installation 4. Programming • Environment Configuration • Data Preparation • Data Subsetting • Data Visualization (using ggplot) o Histogram o Density o Barplot o Boxplot o Pairs o Stack Bar o Scatter Plot o Scatter Plot1 5. Conclusion
  • 4. 4 Introduction It is a dataset of survival of women patients undergone breast cancer surgery, it has cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients from the surgery. Where prediction of the survival of the patients through analysis such as exploratory data analysis and to give insights. There are 3 attributes present in the data set where the dimension is 306 rows and 4 columns. Hence the columns are specified as 1. Age (age from 30 to 83) 2. Year of operation (between 1958 to 1970) 3. Number of Positive Axillary nodes (Lymph Nodes) 4. Survival Status o 1 = the patient survived more than 5 years or longer after surgery o 2 = the patient died within 5 years after surgery As there are no missing values and the dataset is already noise free available at UCI Repository. Lymph Node: Lymph nodes are small, bean-shaped organs that act as filters along the lymph fluid channels. As lymph fluid leaves the breast and eventually goes back into the bloodstream, the lymph nodes try to catch and trap cancer cells before they reach other parts of the body. Having cancer cells in the lymph nodes under your arm suggests an increased risk of the cancer spreading. In our data, it is axillary nodes detected (0–52) The lymph nodes in your armpit (axillary lymph nodes) are the most common place that cancer cells would lodge, causing those nodes to swell.
  • 5. 5 In context i.e. the more swelled nodes the more the chances of having cancer for the female patients as its very unlikely or rare that men develop breast cancer . If breast cancer spreads, the lymph nodes in the underarm (the axillary lymph nodes) are the first place it’s likely to go. Lymph node status is highly related to prognosis. o Lymph node-negative means the axillary lymph nodes do not contain cancer. o Lymph node-positive means the axillary lymph nodes contain cancer. Prognosis is better when cancer has not spread to the lymph nodes (lymph node-negative) Objectives of the project:  To understand the dataset  To apply exploratory data analysis. Scope of the project:  To get answers for the data which will be analysed and learn the techniques of exploratory data analysis.  To predict the important values from the analysis.  To predict the inter-relationship between the data
  • 6. 6 Exploratory Data Analysis: o Exploratory Data Analysis (EDA) is the series of asking questions and applying statistics and visualization techniques on data to answer the questions and to find and understand the hidden insights/information from the data. o Setting the target is the key in EDA which will form or build your ideas in the entire analysis. In Haberman dataset, the objective is to predict/forecast whether the patient will survive after 5 years or not based on the patient’s age, year of operation and the number of positive (lymph) nodes. Something about Multivariate Analysis: o Multivariate Analysis— is performed to understand interactions between different fields in the dataset. o We have the dataset of haberman is also multivariate, hence we have to understand the relationship and how it is depended on each other(an attribute depended on other) o Dimensionality reduction — helps to understand the fields in the data that account for the most variance between observations and allow for the processing of a reduced volume of data. Something about Uni- variate Analysis: o The major purpose of the Uni-variate analysis is to describe, summarize and find patterns in the single feature. Hardware Requirements: RAM: - Minimum- 6GB-Win, 8GB-Mac; Recommended- 8GB Storage: - Minimum- 7200RPM STATA with 20GB of available space, Recommended-SSD with 40GB of available space Processor: - Minimum-Intel Core i3 2.5G hz, Recommended-Intel Core i5 Software Requirements: R and R Studio Excel Word 1. Environment Configuration Load the data set in a variable. getwd() ## [1] " "C:/Users/hp/Documents" setwd("D:/datasets/manju")
  • 7. 7 getwd() ## [1] ""D:/datasets/manju"" hb<-read.csv("D:/manju/data/haberman.csv", header = T, sep = ",") hb_data<-hb # age year_operated p_node survival_status # 1 30 64 1 1 # 2 30 62 3 1 # 3 30 65 0 1 # 4 31 59 2 1 # 5 31 65 4 1 # 6 33 58 10 1 # 7 33 60 0 1 # 8 34 59 0 2 # 9 34 66 9 2 # 10 34 58 30 1 1. Data Preparation About the dataset once all set # Checked the class of dataset. class(hb) ## [1] "data.frame" We get to know it’s a dataframe. 2. Understanding the summary summary(hb) ## age year_operated p_nodes survival_status ## Min. :30.00 Min. :58.00 Min. : 0.000 Min. :1.000 ## 1st Qu.:44.00 1st Qu. :60.00 1st Qu. : 0.000 1st Qu.:1.000 ## Median :52.00 Median :63.00 Median :1.000 Median :1.000 ## Mean :52.46 Mean :62.85 Mean : 4.026 Mean :1.265 ## 3rd Qu.:60.75 3rd Qu. :65.75 3rd Qu. : 4.000 3rd Qu.:2.000 ## Max. :83.00 Max. :69.00 Max. :52.000 Max. :2.000 o Here we understand from age summary that there is minimum age in the data set is 30 and max is 83 underwent for surgery, o In year of operation the start year is 1958 and end year is 1969 where operation started and ended in the dataset o Positive axillary nodes show minimum is 0 and maximum in a patient is 52 where the percentage for that person is highly likely to be diagnosed with cancer
  • 8. 8 o Also got the quantiles, mean, median of the dataset. # variance of all attributes. var(hb) ## age year_operated p_node survival_status ## age 116.7145827 3.142912247 -4.9070824 0.324397300 ## year_operated 3.1429122 10.558630665 -0.0879460 -0.006846673 ## p_nodes -4.9070824 -0.087945998 51.6911175 0.911089682 ## s survival_status 0.3243973 -0.006846673 0.9110897 0.195274831 o Here we get the variance where variance is average squares of distance between (actual point-estimated point/predicted point) o Variance also tends to give an error where this error gives the actual distance between average sum of squares. # Overall structure of the dataset. str(hb) ## 'data.frame': 306 obs. of 4 variables: ## $ age : double 30 30 30 31 31 33 33 34 34 34 ... ## $ year_operated : double 64 62 65 59 65 58 60 59 66 58 ... ## $ p_node : double 1 3 0 2 4 10 0 0 9 30 ... ## $ survival_status : double 1 1 1 1 1 1 1 2 2 1 ... o Here got the structure of Haberman dataset. In which found the class, dimension, attributes names and datatypes, n values of each attribute. The dataset contains 306 observation and 4 predictors or attributes and each attribute have “double” datatype. #Column names of Haberman Dataset colnames(hb) ## [1] "age" "year_operated" "p_node" "survival_status" Here we got the colnames of 4 different related attributes. # head function head(hb) ## age year_operated p_node survival_status ## 1 30 64 1 1 ## 2 30 62 3 1 ## 3 30 65 0 1 ## 4 31 59 2 1 ## 5 31 65 4 1 ## 6 33 58 10 1
  • 9. 9 Using head(), got the first 6 rows and columns from the dataset. # tail function tail(hb) ## age year_operated p_node survival_status ## 301 74 63 0 1 ## 302 75 62 1 1 ## 303 76 67 0 1 ## 304 77 65 3 1 ## 305 78 65 1 2 ## 306 83 58 2 2 Using tail(), got the last 6 rows and columns from the dataset. # Range (we get the start and the end point in a dataset of a particular attribute) Range(hb$age) ## [1] 30 83 Range(hb$year_operated) ## [1] 58 69 Range(hb$p_nodes) ## [1] 0 52 Here got the same as the summary value observations done # table() function table(hb$age) ## 30 31 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 ## 3 2 2 7 2 2 6 10 6 3 10 9 11 7 9 7 11 7 10 12 6 14 11 13 10 ## 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 83 ## 7 11 7 8 6 9 7 8 5 10 5 6 2 4 7 1 4 2 2 1 1 1 1 1 table(hb$year_operated) ## 58 59 60 61 62 63 64 65 66 67 68 69 ## 36 27 28 26 23 30 31 28 28 25 13 11 table(hb$p_node)
  • 10. 10 ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ## 136 41 20 20 13 6 7 7 7 6 3 4 2 5 4 3 1 1 ## 18 19 20 21 22 23 24 25 28 30 35 46 52 ## 1 3 2 1 3 3 1 1 1 1 1 1 1 table(hb$survival_status) ## 1 2 ## 225 81 o In the table(hb$p_node) shows here that 136 people have 0 nodes,1 person has 52 nodes,41 people have 1 node and also 20 people have 2 nodes and so on. Data Subsetting : Subsetting is a way of getting just the required data of large dataset in which an analyst is interested. hb_sub1<-subset(hb, survival_status == '1') hb_sub2<-subset(hb,survival_status=='2') hb_sub2 hb_sub1 # age year_operated p_node survival_status # 1 34 59 0 2 # 2 34 66 9 2 # 3 38 69 21 2 # 4 39 66 0 2 # 5 41 60 23 2 # 6 41 64 0 2 # 7 41 67 0 2 # 8 42 69 1 2 # 9 42 59 0 2 # 10 43 58 52 2 And so on.. Application of Grammer of graphics (ggplot) We can create various visualization and understand the dataset more interactively and make a non- data understanding person to give insights where he can understand what is going on in the data set by finding correlation’s and references and ways of formulating analysis.
  • 11. 11 Installation of the packages install.packages("ggplot2") library(ggplot2) #Visualisation using Barplot #we created a bar_plot for survival_status of patients hb_bar<-ggplot(data=hb,aes(x=hb$survival_status))+geom_bar() #from here we understood that these many people survived and didn’t survive chart_hb<-ggplot(data = hb, aes(x=hb$survival_status))+geom_bar() chart_hb1<-chart_hb+labs(title="number of survival vs didnt survive ",x="Survival Status",y="Number of patients") chart_hb2<- chart_hb1+geom_text(stat = 'Count',aes(label=..count..),vjust=-0.25) #here we add count to the barplot
  • 12. 12 Observations: o Shows a clear indication that 225 patients survived and 81 patients died o Through count function and updating the x and y axis labelling which is called as scale. o The plot shows us the clear picture of the survival number in the data of 306 patients
  • 13. 13 Visualisation using Scatter_plot o scatter_hb<-ggplot(data=hb_data,aes(x=hb_data$year_operated,y=hb_data$age,shape= hb_data $survival_status, color = hb_data $ survival_status))+geom_point() o It shows the age factor of people surviving or not in the year of operation o Where the color variety shows the criticalness of the nodes o Triangle indication is of death within 5 years o The circle indicates living more than five years o From this, we get scatter and also point and also being used smooth function o We get to know that its age v/s year_operated(year of operation in 1900) o Color function gives us 2 colors for the survival rate Observations: o That in year 1958 there are around 24 people/women went under operation for breast cancer that 5 didn’t live upto 5 years or died within 5 years o It can be noticed as the years passes the death rate is decreasing o Also that in the year 1965, 10 people died
  • 14. 14 #Visualisation Using Geom_point hb_scatter1<-ggplot(data=hb, aes(x=year,y=age,shape=survival_status,color = s_status)) + geom_point() +geom_smooth() hb_scatter1 o Observation: o The blue line indicated that as the year of operation as increases then the chances of survival increases as mortality rate decreses o Which in context tells as the doctors learned about the problem, more the experience they got they were able to save the person with the problem of breast cancer
  • 15. 15 #Visualisation Using Stack Bar hab_stack<- ggplot(data = haberman,aes(x=age,y=year_operated,fill=survival_status)) hab_stack1<-hab_stack+geom_bar(stat = "identity") hab_stack3<- hab_stack+geom_bar(stat = "identity",position = position_dodge()) #here we can see that fewer patients in later years hardly survived 5 years ex 83years old patient #and fewer young patients have not survived till the age of 40 #most of the people between 41 years of age till 67 years have shown most deaths within 5 years o We took into consideration a plot year_operated v/s age and with indication of survival status Stack bar(p_node v/s age with indication of survival_status) hb_stack3<- ggplot(data = hb, aes(x=age,y=p_node ,fill=survival_status)) hb_stack4<-hab_stack3+geom_bar(stat = "identity") hb_stack5<- hab_stack3+geom_bar(stat = "identity",position = position_dodge())
  • 16. 16 o Here its indicating more the p_nodes less the survival rate o With wrong prognosis then there is less chances of survival of the patients
  • 17. 17 #Visualisation Using Density plot hb_den<- ggplot(data = hb_data,aes(x=hb_data$year_operated color=survival_status))+geom_density() #here it shows the operations were less from 59.5 to 64.5 of patients and grew till 1965 and gradually reduced from 1966 o In this plot the black circle shows the patients whose survived more than 5 year considering the age, year of surgery and positive nodes and red circle shows the patients whose died within the 5 years o Density shows the distribution of numeric variable
  • 18. 18 #Visualization using Pairs #pairs(hb_data[,1:3], col= hb_data[,4], oma=c(4,4,6,12)) par (xpd= TRUE) legend (0.85,0.6,as.vector(unique(hb_data$survival_status)),fill = c(1,2)) o later on we pass the plots attribute using oma() to display the plot. o Then we divide the window in two parts using par() function. o first window there is plot and second window there is legend. o Legend we put position and then output attribute. o pairs we plot particualr attribute then output columns attribute which is survival status o Here we created the plot where all the attributes except survival status using ggplot2 package.
  • 19. 19 Visualisation Using Histogram hb_hist <- ggplot(data=hb_data,aes(x = age, color = survival_status)) + geom_histogram(binwidth =1, fill= "White") hb_hist(to call the assignment) Observations: o Here we found the survival status and corresponding to age o It showcases that fewer number of people have died in blue colour histogram o Bigger number of people have survived in red colour histogram plot o There is one 83 year old who didn’t survive
  • 20. 20 Visualization using boxplot boxplot(hb_data$p_node, main = "Boxplot of Positive Nodes", xlab = "P_Nodes", ylab = "Counts", col = "yellow") boxplot(hb_subset1[,1:3], main = "Patients Survived 5 years or longer", las = 2,col = rainbow(3)) boxplot(hb_subset2[,1:3], main = "Patients have not Survived 5 years", las = 2,col = rainbow(3)) o 1st is for positive auxiliary nodes o 2nd and 3rd is for survival for less or more than 5 years o first boxplot min is 4.026 and outliers starts from 10 and ends with 52 Observations: o Here it shows that the outliers are present in the p_node as that the data ranges from 0 to 52 hence indicates the median is so less and outliers are present o Hence shows 52 which is at an extreme extent o Where due to median the extent values wouldn’t affect the data set such as happens in mean
  • 21. 21 Visualisation using Decision tree: model <- rpart (survival_status ~., #here ~ denotes that in corelation/from this only to take decision data = hb_data, control = rpart.control(minsplit = 3)) plot(model, compress = TRUE) text(model, cex = 0.5, use.n = TRUE, fancy = FALSE, all = TRUE) Observations: o Here we created a decision tree where with the minimum split is 3. o It has plotted the whole dataset given its survival status o It shows values who are in 1 but tending towards 2 such as 1.808 such values are also present which can give an idea that in future the person can die rather than surviving for more than 5 years
  • 22. 22 Conclusion o Although the maximum number of positive lymph nodes observed is 52, nearly 75% of the patients have less than 5 positive lymph nodes and nearly 25% of the patients have no positive lymph nodes. o Almost 80% of the patients have less than or equal to 5 positive lymph nodes o As per previous records we found that only 225 people survive and 81 people have not survived.