SlideShare a Scribd company logo
1 of 9
Introduction
Random forest is one of the most successful integration methods, showing excellent
performance at the level of promotion and support vector machines. The fast, anti-noise process
does not overfit and provides the possibility to interpret and visualize its output. We will study
options to increase the strength of individual trees in the forest or reduce their correlation. Using
several attribute evaluation methods instead of just one method will produce promising results.
On the other hand, in most similar cases, using weighted marginal voting instead of ordinary
voting can provide statistically significant improvements across multiple data sets.
Nowadays, machine learning (ML) is becoming more and more critical, and with the
rapid growth of medical data and information quality, it has become a key technology. However,
due to complex, incomplete, and multi-dimensional healthcare data, early and accurate detection
of diseases remains a challenge. Data preprocessing is an essential step in machine learning. The
primary purpose of machine learning is to provide processed data to improve prediction
accuracy. This dissertation summarizes accessible data preprocessing steps based on usage,
popularity, and literature. After that, the selected preprocessing method is applied to the original
data, and then the classifier uses it for prediction.
Data mining faces the test of finding orderly information in critical information streams
to help the executives dynamic. Although the examination on activities research, direct
showcasing and AI centers around the investigation and structure of information mining
calculations, the connection between information mining and the past phase of information
Preprocessing has not been concentrated in detail. This paper considers the impacts of various
preprocessing techniques of appropriate scaling, testing, order coding, and constant trait coding
on the exhibition of choice trees, neural systems, and bolster vector machines.
Problem statement.
We are utilizing machine learning to predict breast cancer cases through patient treatment
history and health data. We will utilize the Data set of Wisconsin breast cancer center. Among
ladies, breast cancer is the main source of death. Breast cancer risk prediction can give
information to screening and preventive measures.
Recent studies found that adding contribution to the broadly utilized Gaelic model can
improve its capacity to anticipate breast cancer risks. Be that as it may, these models utilize
straightforward factual designs, and other information originates from costly and obtrusive
procedures.
Interestingly, we need to come up a machine learning model that utilizes individual
health data to predict breast cancer risk for more than five years. There is a need to come up with
a machine learning model utilizing just Gail model information and a model utilizing Gail model
information and other individual health data identified with breast cancer hazard.
The essential objectives of cancer prediction are not the same as those of cancer
recognition and determination. In cancer prediction/visualization, one is identified with three
basic purposes of prediction: 1) cancer vulnerability prediction (i.e., risk evaluation), 2) cancer
recurrence prediction and 3) cancer endurance rate prediction. In the first case, individuals are
attempting to foresee the probability of building up a specific sort of cancer before it happens. In
the subsequent case, individuals are attempting to foresee the chance of creating cancer after the
infection has vanished.
In the third case, individuals attempt to anticipate the result (life hope, endurance,
movement, sedate tumour affectability) after the disease is diagnosed. In the last two cases,
prognostic prediction's prosperity depends to some extent on the achievement or nature of the
finding. Be that as it may, the forecast of the infection must be accomplished after clinical
finding, and visualization prediction must think about more than a basic determination.
Through a multifaceted analysis of the variance of various performance indicators and
method parameters, it is possible to evaluate and provide empirical evidence that data
preprocessing will significantly affect the accuracy of prediction, and that specific solutions have
proven inferior to competing methods. It is also found that: (1) The selected method is proved to
be sensitive to different data representation methods such as method parameterization, which
shows the potential of improving performance through effective preprocessing; (2) The influence
of the preprocessing scheme depends on the process.
Different, indicators that use various "best practice" settings can improve the amazing
results of a particular method; (3) Therefore, the sensitivity of the algorithm to preprocessing is a
necessary criterion for method evaluation and selection. In predictive data mining, it needs to be
different from traditional methods. Careful consideration of forecasting ability and calculation
efficiency indicators.
To maximize the prediction accuracy of data mining, machine learning research mainly
focuses on enhancing competitive classifiers and effectively adjusting algorithm parameters.
This is usually tested in extensive benchmark experiments, using pre-processed data sets to
evaluate the impact on prediction accuracy and computational efficiency.
In contrast, the research on component selection resampling and continuous quality
discretization has been studied in detail, and there are not many publication survey data
predictions that will affect classification attributes and scaling. More critically, in data mining,
especially in the medical field, there is no precise analysis of the interaction of prediction
accuracy.
3.1. Preprocessing methods
This dissertation considers the three main standard preprocessing steps of NLP:
stemming, punctuation expulsion, and stop word evacuation. In stemming analysis, we obtain the
stem type of each word in the data set, which is a piece of the name that can be attached with
affixes.
The blocking algorithm is language-specific and differs in performance and accuracy. A
wide range of methods can be used, such as fasten deletion stemming, n-gram stemming, and
table inquiry stemming. A critical preprocessing step of NLP is to expel punctuation, which-used
to separate the content into sentences, paragraphs, and phrases-affects the results of any content
processing method, especially the effects that rely upon the recurrence of words and phrases
because punctuation is Often used in the content.
Before any NLP processing, the most common terms used in stop words are erased. A
gathering of as often as possible used words without some other information, such as articles,
specific words, and prepositions called stop words. By eliminating these original words from the
content, we can focus on the critical words.
Significance of using Random Forest?
Whether you have a regression task or a classification task, a random forest is a suitable
model to solve your problem. It can handle dual features, classification features and numeric
features. Hardly any pretreatment is required. The data should not be rescaled or transformed.
They are parallelizable, which means we can split the process into various machines to
run. This can shorten the calculation time. On the contrary, the upgraded model is sequential and
takes longer to calculate. In fact, in Python, to run this code on many computers, add "jobs = -1"
to the boundary. One way is to use every available PC. Great and high size.
Training is faster than decision trees, because we only arrange part of the features in the
model, so we can easily use hundreds of features. The prediction speed is significantly faster
than the training speed because we can save the resulting forest after some time. Random forest
deals with outliers by essentially classifying them. It is also indifferent to nonlinear features.
It has a way to balance errors in the general embarrassment of the class. Random forest
tries to minimize the overall error rate. When the data set is not uniform, the wider the
classification, the lower the error rate, and the lower the classification, the higher the error rate.
The difference between each decision tree is larger, and the deviation is smaller. Nevertheless,
since we normalized all the trees in the random forest, we also normalized the normalization, so
we have a small deviation and a medium difference model.
As with any algorithm, there are advantages and disadvantages to using it. The
advantages and disadvantages of using the random forest for classification and regression. The
random forest algorithm does not depend on any model because there are various trees, and each
tree is trained on a subset of the data.
The random forest algorithm relies on the strength of the "group". Therefore, the general
deviation of the algorithm is reduced. The algorithm is completely stable. Regardless of whether
new data points are introduced in the data set, the general algorithm will not be affected too
much, because the original data may affect one tree, but it is difficult to change all trees.
The random forest algorithm with both classification and numbering functions works
well. The random forest algorithm can also work well when the data lacks values or is not scaled
proportionally (although we have scaled the elements in this article only for demonstration
purposes).
Drawbacks
Interpretability of the model: The random forest model is not easy to interpret. They are
similar to secret elements. For large data sets, the size of the tree can take up a lot of memory. It
may be too suitable, so you should adjust the Hyperparameters. It has been observed that random
forests are too suitable for specific data sets with noisy classification/regression tasks. It is more
complicated than the decision tree algorithm and requires a lot of calculation. Due to their
complexity, they require more training opportunities than other similar algorithms.
Materials and methods
Data
The model was trained and evaluated on the PLCO dataset. This data set was generated
as part of a randomized, controlled, prospective study to determine the effectiveness of different
prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research
and filled out the baseline questionnaire, detailing their previous and current health status. All
processing of this data set is done in Python (version 3.6.7).
We initially downloaded the data of all women from the PLCO data set. The dataset
consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the
following conditions:
1. Lack of data on whether they have been diagnosed with breast cancer and the time of
diagnosis
2. Were diagnosed with breast cancer before the baseline questionnaire
3. Not Self-identification as white, black, or Hispanic
4. Identified as Hispanic, but no information about the place of birth
5. Missing data for 13 selected predictors
Before the baseline questionnaire, we excluded women who had been diagnosed with
breast cancer because BCRAT was not sufficient for women with a personal history of breast
cancer.
BCRAT is also not suitable for women with breast cancer who have received chest
radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal
carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil
syndrome. Since there is no data for these conditions in the PLCO data set we assume that these
conditions do not apply to any women in the data set. Since only PLCO white, black, and
Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded
specific topics based on self-identified race/ethnicity.
We do not include subjects who consider themselves Hispanic but do not have data on
their place of birth because BCRAT implements different breast cancer compound rates for US-
born and foreign-born Hispanic women. When deleting objects based on these conditions, we
reduced the number of women to 64,739.
We trained a set of machine learning models that fed five of the usual seven inputs into
BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first-
degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in
the PLCO data set. We compared the machine learning model BCRAT and got these five inputs.
Our input to the model with a broader set of predictors includes five BCRAT data and
eight additional factors. These other predictors were selected based on the availability in the
PLCO data set and their correlation with breast cancer risk including menopausal age, indicators
of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of
years of birth control, the number of live births, and personal cancer history indicators.
To facilitate the training and testing of the model, we made limited modifications to the
predictor variables. First, we assign values to categorical variables appropriately. The PLCO data
set classifies age at menarche, age at first live birth, age at menopause, generation of hormones,
and age of birth control as categorical variables. For example, the menarche variable's age code
is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15
years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that
represents the maximum age/age or less (for example, under ten years old), we set the value of
the variable to the maximum value (for example, ten years old).
For values that represent a range strictly less than the maximum value (for example, less
than ten years old), we set the variable's value equal to the upper limit of the range (for example,
less than ten years old). Similarly, for values representing the minimum age/age or above (16
years old or above), we set it to the minimum value (for example, 16 years old). For values that
contain a closed range (for example, 12-13 years old), we set the variable's importance to the
average cost of the field (for example, 12.5 years old).
After modifying the categorical variables, we made some adjustments to the age version
of the first live birth, and the race/ethnic variables entered into the machine learning model. For
the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as
the "BCRA" software package (version 2.1) in R (version 3.4.3), using The implementation of
BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and
American-born Hispanic women. For the machine learning model, we set the age of the first live
birth variable of zero birth women to the current generation, and use two indicators to represent
race/ethnicity, one symbol for white women and one indicator for black women. Each woman is
classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the
white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s
white and black racial symbols are both 0. For the machine learning model, we did not
distinguish between Hispanic women born in the United States and Hispanic women born
abroad.

More Related Content

What's hot

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Sunil Nair
 

What's hot (19)

PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
PREDICTION OF MALIGNANCY IN SUSPECTED THYROID TUMOUR PATIENTS BY THREE DIFFER...
 
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
 
Efficiency of Prediction Algorithms for Mining Biological Databases
Efficiency of Prediction Algorithms for Mining Biological  DatabasesEfficiency of Prediction Algorithms for Mining Biological  Databases
Efficiency of Prediction Algorithms for Mining Biological Databases
 
Cancer detection using data mining
Cancer detection using data miningCancer detection using data mining
Cancer detection using data mining
 
IRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data MiningIRJET - Employee Performance Prediction System using Data Mining
IRJET - Employee Performance Prediction System using Data Mining
 
Srge most important publications 2020
Srge most important  publications 2020Srge most important  publications 2020
Srge most important publications 2020
 
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTIONCATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
Melissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AIMelissa Informatics - Data Quality and AI
Melissa Informatics - Data Quality and AI
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
 
QuahogLife | Solutions and Services
QuahogLife | Solutions and ServicesQuahogLife | Solutions and Services
QuahogLife | Solutions and Services
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
 
Data analysis
Data analysisData analysis
Data analysis
 
[IJCT-V3I2P26] Authors: Sunny Sharma
[IJCT-V3I2P26] Authors: Sunny Sharma[IJCT-V3I2P26] Authors: Sunny Sharma
[IJCT-V3I2P26] Authors: Sunny Sharma
 
IRJET- Disease Prediction System
IRJET- Disease Prediction SystemIRJET- Disease Prediction System
IRJET- Disease Prediction System
 
a novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaa novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool weka
 

Similar to Introductionedited

An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
ijsc
 
Running Head Data Mining in The Cloud .docx
Running Head Data Mining in The Cloud                            .docxRunning Head Data Mining in The Cloud                            .docx
Running Head Data Mining in The Cloud .docx
healdkathaleen
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
ijscai
 

Similar to Introductionedited (20)

An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
 
Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...Anomaly detection via eliminating data redundancy and rectifying data error i...
Anomaly detection via eliminating data redundancy and rectifying data error i...
 
Running Head Data Mining in The Cloud .docx
Running Head Data Mining in The Cloud                            .docxRunning Head Data Mining in The Cloud                            .docx
Running Head Data Mining in The Cloud .docx
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONSVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTION
 
TBerger_FinalReport
TBerger_FinalReportTBerger_FinalReport
TBerger_FinalReport
 
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNINGSEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
 
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNINGSEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
 
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNINGSEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
SEGMENTATION OF THE GASTROINTESTINAL TRACT MRI USING DEEP LEARNING
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
HEALTH PREDICTION ANALYSIS USING DATA MINING
HEALTH PREDICTION ANALYSIS USING DATA  MININGHEALTH PREDICTION ANALYSIS USING DATA  MINING
HEALTH PREDICTION ANALYSIS USING DATA MINING
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
IRJET- Breast Cancer Prediction using Supervised Machine Learning Algorithms
 

More from Mefratechnologies (9)

Cyber bullying
Cyber bullyingCyber bullying
Cyber bullying
 
Pgbm161+module+guide+oct+2020+starts
Pgbm161+module+guide+oct+2020+startsPgbm161+module+guide+oct+2020+starts
Pgbm161+module+guide+oct+2020+starts
 
Impact of hrm on organization growth thesis
Impact of hrm on organization growth thesisImpact of hrm on organization growth thesis
Impact of hrm on organization growth thesis
 
Poster template assessment 1 uncc300 sem 2 2020 (editable file) (2)
Poster template assessment 1 uncc300 sem 2 2020 (editable file) (2)Poster template assessment 1 uncc300 sem 2 2020 (editable file) (2)
Poster template assessment 1 uncc300 sem 2 2020 (editable file) (2)
 
Addition text
Addition textAddition text
Addition text
 
Poster template for global health council edited
Poster template for global health council editedPoster template for global health council edited
Poster template for global health council edited
 
Poster template for global health council
Poster template for global health councilPoster template for global health council
Poster template for global health council
 
Food fair
Food fairFood fair
Food fair
 
Final charter edited
Final charter editedFinal charter edited
Final charter edited
 

Recently uploaded

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Recently uploaded (20)

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 

Introductionedited

  • 1. Introduction Random forest is one of the most successful integration methods, showing excellent performance at the level of promotion and support vector machines. The fast, anti-noise process does not overfit and provides the possibility to interpret and visualize its output. We will study options to increase the strength of individual trees in the forest or reduce their correlation. Using several attribute evaluation methods instead of just one method will produce promising results. On the other hand, in most similar cases, using weighted marginal voting instead of ordinary voting can provide statistically significant improvements across multiple data sets. Nowadays, machine learning (ML) is becoming more and more critical, and with the rapid growth of medical data and information quality, it has become a key technology. However, due to complex, incomplete, and multi-dimensional healthcare data, early and accurate detection of diseases remains a challenge. Data preprocessing is an essential step in machine learning. The primary purpose of machine learning is to provide processed data to improve prediction accuracy. This dissertation summarizes accessible data preprocessing steps based on usage, popularity, and literature. After that, the selected preprocessing method is applied to the original data, and then the classifier uses it for prediction. Data mining faces the test of finding orderly information in critical information streams to help the executives dynamic. Although the examination on activities research, direct showcasing and AI centers around the investigation and structure of information mining calculations, the connection between information mining and the past phase of information Preprocessing has not been concentrated in detail. This paper considers the impacts of various preprocessing techniques of appropriate scaling, testing, order coding, and constant trait coding on the exhibition of choice trees, neural systems, and bolster vector machines.
  • 2. Problem statement. We are utilizing machine learning to predict breast cancer cases through patient treatment history and health data. We will utilize the Data set of Wisconsin breast cancer center. Among ladies, breast cancer is the main source of death. Breast cancer risk prediction can give information to screening and preventive measures. Recent studies found that adding contribution to the broadly utilized Gaelic model can improve its capacity to anticipate breast cancer risks. Be that as it may, these models utilize straightforward factual designs, and other information originates from costly and obtrusive procedures. Interestingly, we need to come up a machine learning model that utilizes individual health data to predict breast cancer risk for more than five years. There is a need to come up with a machine learning model utilizing just Gail model information and a model utilizing Gail model information and other individual health data identified with breast cancer hazard. The essential objectives of cancer prediction are not the same as those of cancer recognition and determination. In cancer prediction/visualization, one is identified with three basic purposes of prediction: 1) cancer vulnerability prediction (i.e., risk evaluation), 2) cancer recurrence prediction and 3) cancer endurance rate prediction. In the first case, individuals are attempting to foresee the probability of building up a specific sort of cancer before it happens. In the subsequent case, individuals are attempting to foresee the chance of creating cancer after the infection has vanished. In the third case, individuals attempt to anticipate the result (life hope, endurance, movement, sedate tumour affectability) after the disease is diagnosed. In the last two cases, prognostic prediction's prosperity depends to some extent on the achievement or nature of the
  • 3. finding. Be that as it may, the forecast of the infection must be accomplished after clinical finding, and visualization prediction must think about more than a basic determination. Through a multifaceted analysis of the variance of various performance indicators and method parameters, it is possible to evaluate and provide empirical evidence that data preprocessing will significantly affect the accuracy of prediction, and that specific solutions have proven inferior to competing methods. It is also found that: (1) The selected method is proved to be sensitive to different data representation methods such as method parameterization, which shows the potential of improving performance through effective preprocessing; (2) The influence of the preprocessing scheme depends on the process. Different, indicators that use various "best practice" settings can improve the amazing results of a particular method; (3) Therefore, the sensitivity of the algorithm to preprocessing is a necessary criterion for method evaluation and selection. In predictive data mining, it needs to be different from traditional methods. Careful consideration of forecasting ability and calculation efficiency indicators. To maximize the prediction accuracy of data mining, machine learning research mainly focuses on enhancing competitive classifiers and effectively adjusting algorithm parameters. This is usually tested in extensive benchmark experiments, using pre-processed data sets to evaluate the impact on prediction accuracy and computational efficiency. In contrast, the research on component selection resampling and continuous quality discretization has been studied in detail, and there are not many publication survey data predictions that will affect classification attributes and scaling. More critically, in data mining, especially in the medical field, there is no precise analysis of the interaction of prediction accuracy.
  • 4. 3.1. Preprocessing methods This dissertation considers the three main standard preprocessing steps of NLP: stemming, punctuation expulsion, and stop word evacuation. In stemming analysis, we obtain the stem type of each word in the data set, which is a piece of the name that can be attached with affixes. The blocking algorithm is language-specific and differs in performance and accuracy. A wide range of methods can be used, such as fasten deletion stemming, n-gram stemming, and table inquiry stemming. A critical preprocessing step of NLP is to expel punctuation, which-used to separate the content into sentences, paragraphs, and phrases-affects the results of any content processing method, especially the effects that rely upon the recurrence of words and phrases because punctuation is Often used in the content. Before any NLP processing, the most common terms used in stop words are erased. A gathering of as often as possible used words without some other information, such as articles, specific words, and prepositions called stop words. By eliminating these original words from the content, we can focus on the critical words. Significance of using Random Forest? Whether you have a regression task or a classification task, a random forest is a suitable model to solve your problem. It can handle dual features, classification features and numeric features. Hardly any pretreatment is required. The data should not be rescaled or transformed. They are parallelizable, which means we can split the process into various machines to run. This can shorten the calculation time. On the contrary, the upgraded model is sequential and
  • 5. takes longer to calculate. In fact, in Python, to run this code on many computers, add "jobs = -1" to the boundary. One way is to use every available PC. Great and high size. Training is faster than decision trees, because we only arrange part of the features in the model, so we can easily use hundreds of features. The prediction speed is significantly faster than the training speed because we can save the resulting forest after some time. Random forest deals with outliers by essentially classifying them. It is also indifferent to nonlinear features. It has a way to balance errors in the general embarrassment of the class. Random forest tries to minimize the overall error rate. When the data set is not uniform, the wider the classification, the lower the error rate, and the lower the classification, the higher the error rate. The difference between each decision tree is larger, and the deviation is smaller. Nevertheless, since we normalized all the trees in the random forest, we also normalized the normalization, so we have a small deviation and a medium difference model. As with any algorithm, there are advantages and disadvantages to using it. The advantages and disadvantages of using the random forest for classification and regression. The random forest algorithm does not depend on any model because there are various trees, and each tree is trained on a subset of the data. The random forest algorithm relies on the strength of the "group". Therefore, the general deviation of the algorithm is reduced. The algorithm is completely stable. Regardless of whether new data points are introduced in the data set, the general algorithm will not be affected too much, because the original data may affect one tree, but it is difficult to change all trees. The random forest algorithm with both classification and numbering functions works well. The random forest algorithm can also work well when the data lacks values or is not scaled
  • 6. proportionally (although we have scaled the elements in this article only for demonstration purposes). Drawbacks Interpretability of the model: The random forest model is not easy to interpret. They are similar to secret elements. For large data sets, the size of the tree can take up a lot of memory. It may be too suitable, so you should adjust the Hyperparameters. It has been observed that random forests are too suitable for specific data sets with noisy classification/regression tasks. It is more complicated than the decision tree algorithm and requires a lot of calculation. Due to their complexity, they require more training opportunities than other similar algorithms. Materials and methods Data The model was trained and evaluated on the PLCO dataset. This data set was generated as part of a randomized, controlled, prospective study to determine the effectiveness of different prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research and filled out the baseline questionnaire, detailing their previous and current health status. All processing of this data set is done in Python (version 3.6.7). We initially downloaded the data of all women from the PLCO data set. The dataset consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the following conditions: 1. Lack of data on whether they have been diagnosed with breast cancer and the time of diagnosis
  • 7. 2. Were diagnosed with breast cancer before the baseline questionnaire 3. Not Self-identification as white, black, or Hispanic 4. Identified as Hispanic, but no information about the place of birth 5. Missing data for 13 selected predictors Before the baseline questionnaire, we excluded women who had been diagnosed with breast cancer because BCRAT was not sufficient for women with a personal history of breast cancer. BCRAT is also not suitable for women with breast cancer who have received chest radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil syndrome. Since there is no data for these conditions in the PLCO data set we assume that these conditions do not apply to any women in the data set. Since only PLCO white, black, and Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded specific topics based on self-identified race/ethnicity. We do not include subjects who consider themselves Hispanic but do not have data on their place of birth because BCRAT implements different breast cancer compound rates for US- born and foreign-born Hispanic women. When deleting objects based on these conditions, we reduced the number of women to 64,739. We trained a set of machine learning models that fed five of the usual seven inputs into BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first- degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in the PLCO data set. We compared the machine learning model BCRAT and got these five inputs.
  • 8. Our input to the model with a broader set of predictors includes five BCRAT data and eight additional factors. These other predictors were selected based on the availability in the PLCO data set and their correlation with breast cancer risk including menopausal age, indicators of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of years of birth control, the number of live births, and personal cancer history indicators. To facilitate the training and testing of the model, we made limited modifications to the predictor variables. First, we assign values to categorical variables appropriately. The PLCO data set classifies age at menarche, age at first live birth, age at menopause, generation of hormones, and age of birth control as categorical variables. For example, the menarche variable's age code is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15 years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that represents the maximum age/age or less (for example, under ten years old), we set the value of the variable to the maximum value (for example, ten years old). For values that represent a range strictly less than the maximum value (for example, less than ten years old), we set the variable's value equal to the upper limit of the range (for example, less than ten years old). Similarly, for values representing the minimum age/age or above (16 years old or above), we set it to the minimum value (for example, 16 years old). For values that contain a closed range (for example, 12-13 years old), we set the variable's importance to the average cost of the field (for example, 12.5 years old). After modifying the categorical variables, we made some adjustments to the age version of the first live birth, and the race/ethnic variables entered into the machine learning model. For the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as the "BCRA" software package (version 2.1) in R (version 3.4.3), using The implementation of
  • 9. BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and American-born Hispanic women. For the machine learning model, we set the age of the first live birth variable of zero birth women to the current generation, and use two indicators to represent race/ethnicity, one symbol for white women and one indicator for black women. Each woman is classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s white and black racial symbols are both 0. For the machine learning model, we did not distinguish between Hispanic women born in the United States and Hispanic women born abroad.