Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Advanced Machine Learning for Business Professionals
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
1. 1
Techno India University, WB
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Simplified Knowledge Prediction:
Application of Machine Learning in Real Life
____________________________________________________
____________________________________________________
Mr. Sayan Adhikary, B.Sc (H) Data Science – 2nd
Year
Miss Ankita Jash, B.Sc (H) Data Science – 2nd
Year
Mr. Avishek Das, B.Sc (H) Data Science – 2nd
Year
2. 2
Acknowledgement:
Techno India University, West Bengal for giving us the
opportunity to do this project.
Mr Shantanu. P. Chakraborty, Assistant Professor, Techno
India University
Mrs. Peea Bal, Assistant Professor, Techno India University
3. 3
Contents:
Serial
No.
Topic Page
No.
1 Abstract 4
2 Introduction 5
3 Scrutinized Analysis of Dataset 8
4 Procedure 9
5 Scope 12
6 Challenges and Opportunities 13
7 Conclusion 14
8 References and Links 15
4. 4
Abstract:
The world today is highly dependent on data. In this data-
driven era with fast growing technologies, a huge amount of
data is generated, captured and also maintained for a variety of
purposes. Machine learning models utilize existing data to
derive meaningful insights over how various factors have an
impact on development of different nations and industries and
predict the outcomes accordingly. In the arena of prediction,
machine learning often works together with the data
visualization techniques to make it easier for the user to
understand the inferences.
This paper aims to analyze and predict the outcome of a real
life case-study using various tools of visual analytics and
machine learning. The dataset consists of cell samples of
patients suspected of breast cancer. We will apply machine
learning models to improve the accuracy of cancer
susceptibility by predicting whether the breast cancer is benign
or malignant. The machine learning models used in this paper
includes decision tree algorithm and logistic regression. The
integration of multidimensionalheterogeneous data, combined
with the application of different machine learning techniques
will show a new path in the domain of cancer detection. The
paper also explores the challenges and limitations so that it
provides future research scopes.
Keywords : Machine Learning, Algorithm, Cancer detection, Visual analytics
5. 5
Introduction:
Incidentally, machine learning is the scientific study of
algorithms and statistical models that is used by the machines
to perform a specific task depending on patterns and inference
rather than explicit instructions .The algorithms of machine
learning divides the data into two parts known as the test data
and validation data. Mathematical model based algorithms are
implemented on the test data, in order to make predictions or
decisions without being explicitly programmed to perform the
task.
Machine learning can be of two types that is supervised
machine learning (contains desired input and output) and
unsupervised machine learning (contains only input but not the
desired output labels). The concept of machine learning is very
important for the predictions done in this paper.
This research and analysis aims to observe how precisely a
machine can predict that a patient suspected of breast cancer is
having malignant or benign cancer. Cancer has been
characterized as a heterogeneous disease consisting of various
subtypes. Early detection and prognosis of a cancer type has
become a necessity in cancer research for facilitating
subsequent clinical management of patients. In this paper the
classification of cancer type and prediction of risk levels is done
by variousmodel of machine learning and is pictoriallydepicted
by various tools of visual analytics.
6. 6
The secondary dataset we have collected was created by Dr.
William H. Wolberg, a physician at the University of Wisconsin
Hospital at Madison, Wisconsin, USA. For creating this dataset
Dr. Wolberg have used fluid samples taken from patients with
solid breast masses and an easy to use graphical computer
program called Xcyt, which is capable of performing the
analysis of cytological features based on a digital scan. The
program initially uses a curve-fitting algorithm for computing
ten features from each one of the cells in the sample and then
calculates the mean value, extreme value and the standard
error of each of the ten features for the image, returning a 30
real valued vector.
A person who sufferedfromBreast Cancer
7. 7
We have performed classification and prediction by decision
tree and logistic regression model of machine learning on the
dataset with the help of python and its various library packages
like pandas, matplotlib, seaborn and scikit. The data
visualization has been done through pairplots.
The inferences and predictions from this paper will be helpful
to cancer research and improve the accuracy of cancer
susceptibility, recurrence and survival prediction.
8. 8
Scrutinized Analysis of the dataset :
In this paper, we have used machine learning repository for
breast cancer dataset. The key attributes of the dataset consists
of the ID number and diagnosis ((M = malignant, B = benign)
3–32). And for each cell nucleus ten real valued features are
computed. The real valued features must include radius,
texture, perimeter, area, smoothness, compactness, concavity,
concave points, fractal dimension and symmetry. The radius is
computed by mean of distances from centre to points on the
perimeter, the texture is calculated by standard deviation of
grey scale values, smoothness is computed by local variation in
radius lengths, compactness is obtained by”perimeter² / area
— 1.0”, concavity is referred as severity of concave portions of
the contour, concave points is the number of concave portions
of the contour and fractal dimension is computed by coastline
approximation— 1.
9. 9
Procedure:
Phase 1: Data Exploration
We have python in anaconda prompt shell to work on this
dataset. In data exploration phase we firstly import the
necessary libraries and our dataset to anaconda prompt shell.
The necessary libraries include scikit, matplotlib, pyplot ,
seaborn and pandas.Then we have provided index numbers for
“x” and “y” values.
Fig 1: Dataset and X set after importing the dataset
After importing the dataset and the required libraries we have
checked if there is any missing or null data points of the data
set using the some pandas function.
10. 10
After checking and removing the nullvalues, the seaborn library
was used for visualization, to find the data distribution of the
features. The visualization below shows a pairplot with all ten
features of the dataset.
Fig 2: Visualization of dataset.
Phase 2: Categorical Data
In this phase we have used Label Encoder library to label the
categorical data. Then we had split the dataset into training
data and test data with the help of SciKit-Learn library in
Python using train_test_split method.
11. 11
Phase 3: Feature Scaling
In this phase of analysis we have scaled the data using the
StandardScaler method from SciKit-Learn library of python.
Phase 4: Model Selection
This is the one of major phase of analysis where we have
applied Machine Learning algorithm to the Dataset. This phase
is also known as Algorithm selection for predicting the best
results as in this phase we have selected the algorithm which
would predict the best result.
In this phase we have used sklearn library to import all the
methods of classification algorithms.
Then we have predicted the test set results and checked the
accuracy with our model. For checking the accuracy we have
imported a confusion_matrix method of metrics class. We have
used Classification Accuracy method to find the accuracy of our
models (where accuracy =number of correct predictions / total
number of predictions).
After checking we have found that our model of logistic
Regression and Decision Tree Algorithm both has 95.8%
accuracy. Hence from this phase we can conclude that both of
the two machine learning models can give us the best result for
our data set.
12. 12
Scope:
This paper will be helpful for the hospitals having modern
cancer treatment facilities.
Oncologists can use our procedure and get accurate results
within seconds.
13. 13
Challenges and Opportunities:
Challenges :
Any error made while generating the algorithm may result
in inaccurate results.
Making people aware of the modern machine learning
techniques and earning their faith is one of the biggest
challenges.
Opportunities :
If this model works well then we can use machine learning
techniques for analyzing and predicting other fatal
diseases.
There is also scope for further research in this field.
14. 14
Conclusion:
Results and Findings: From a dataset of patients suspected
for cancer we can predict whether the cancer is benign or
malignant. If it is malignant then the risk factor is high. If it
is benign then the risk factor is comparatively low.
Machine Learning models will be the future of cancer
prediction.
We have identified a number of trends with respect to the
types of machine learning model being used, the types of
training data being integrated, the kinds of endpoint
predictions being made, the types of cancers being studied
and the overall performance of the models in predicting
cancer susceptibility or outcomes.
Depending upon the analysis of the results, it is evident
that the integration of multidimensional heterogeneous
data, combined with the application of different machine
learning techniques for feature selection, classification and
prediction can provide promising tools for inference in the
cancer domain.