SlideShare a Scribd company logo
1 of 11
Download to read offline
Machine Learning, Key to Your Formulation
Challenges
Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net)
February 17, 2016
Formulation Challenges are Everywhere…
Step1: Retrieve Existing Data
Step 2: Normalize Data
Step 3: Train and Test a Model
Step 4: Evaluate Model Performance
Step 5: Improving Model Performance with 5 Hidden Neurons
Step 6: Improving Model using Random Forest Algorithm
Step 7: Testing Further with Resampling
Step 8: Actual Display of a Random Forest Tree Solution
Conclusions
References
Formulation Challenges are Everywhere…
You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted
with the challenge of blending and formulating to meet process or performance properties. While traditional
Research and Development does approach the problem with experimentation, it generally involves designs,
time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten
or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and
ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation
and classification.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation
challenge. We will discuss the other important aspect of classification and clustering often associated with these
formulations challenges in a forthcoming communication.
Step1: Retrieve Existing Data
To illustrate the approach, we selected a formulation dataset hosted on UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/datasets.html), to predict the compressive strength (http://archive.ics.uci.edu
/ml/datasets/Concrete+Compressive+Strength) performance dependency on the formulation ingredients. This is
the well-known formulation composition - property relationship scientists, engineers and business professionals
must address daily and any established R&D would certainly have similar and sometimes hidden knowledge in
its archives…
We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine-
learning-databases/concrete/compressive/Concrete_Data.xls), and also demonstrate how reproducibility of the
analysis is enforced. The analysis tool and platform are documented, all libraries clearly listed, while data is
retrieved programmatically and date stamped from the repository.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
1 of 11 2/18/2016 5:07 PM
Sys.info()[1:5]
## sysname release version nodename machine
## "Windows" "7 x64" "build 9200" "STALLION" "x86-64"
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] reshape_0.8.5 scales_0.3.0 devtools_1.8.0
## [4] rpart.plot_1.5.3 rpart_4.1-10 randomForest_4.6-10
## [7] neuralnet_1.32 MASS_7.3-44 caret_6.0-52
## [10] ggplot2_1.0.1 lattice_0.20-33 stringr_1.0.0
## [13] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-7
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 git2r_0.11.0 formatR_1.2
## [4] nloptr_1.0.4 plyr_1.8.3 iterators_1.0.7
## [7] tools_3.2.2 digest_0.6.8 lme4_1.1-9
## [10] memoise_0.2.1 evaluate_0.7.2 gtable_0.1.2
## [13] nlme_3.1-121 mgcv_1.8-9 Matrix_1.2-2
## [16] foreach_1.4.2 curl_0.9.3 parallel_3.2.2
## [19] yaml_2.1.13 SparseM_1.7 brglm_0.5-9
## [22] proto_0.3-10 xml2_0.1.1 BradleyTerry2_1.0-6
## [25] knitr_1.11 rversions_1.0.2 MatrixModels_0.4-1
## [28] gtools_3.5.0 stats4_3.2.2 nnet_7.3-11
## [31] rmarkdown_0.9.2 minqa_1.2.4 reshape2_1.4.1
## [34] car_2.0-26 magrittr_1.5 codetools_0.2-14
## [37] htmltools_0.2.6 splines_3.2.2 pbkrtest_0.4-2
## [40] colorspace_1.2-6 quantreg_5.18 stringi_0.5-5
## [43] munsell_0.4.2
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
2 of 11 2/18/2016 5:07 PM
library(xlsx)
library(stringr)
library(caret)
library(neuralnet)
library(devtools)
library(rpart)
library(rpart.plot)
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compress
ive/Concrete_Data.xls?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Concrete_Data.xls",method="curl")
dateDownloaded <- date()
concrete <- read.xlsx("./data/Concrete_Data.xls",sheetName="Sheet1")
str(concrete)
## 'data.frame': 1030 obs. of 9 variables:
## $ Cement..component.1..kg.in.a.m.3.mixture. : num 540 540 332 332 199
...
## $ Blast.Furnace.Slag..component.2..kg.in.a.m.3.mixture.: num 0 0 142 142 132 ...
## $ Fly.Ash..component.3..kg.in.a.m.3.mixture. : num 0 0 0 0 0 0 0 0 0 0
...
## $ Water...component.4..kg.in.a.m.3.mixture. : num 162 162 228 228 192
228 228 228 228 228 ...
## $ Superplasticizer..component.5..kg.in.a.m.3.mixture. : num 2.5 2.5 0 0 0 0 0 0
0 0 ...
## $ Coarse.Aggregate...component.6..kg.in.a.m.3.mixture. : num 1040 1055 932 932 97
8 ...
## $ Fine.Aggregate..component.7..kg.in.a.m.3.mixture. : num 676 676 594 594 826
...
## $ Age..day. : num 28 28 270 365 360 90
365 28 28 28 ...
## $ Concrete.compressive.strength.MPa..megapascals.. : num 80 61.9 40.3 41.1 44
.3 ...
Step 2: Normalize Data
The dataset information reveals 1030 observations with 9 variables: 8 inputs, from which 7 are ingredients and
1 is a process attribute (Age) and 1 output, the strength property. There are no missing values in this set. We’ll
easily truncate the variable names and normalize the data, displaying the normalized strength.
normalize <- function(x) {return((x - min(x)) / (max(x) - min(x)))}
names(concrete)<-gsub("."," ",names(concrete))
names(concrete)<-word(names(concrete),1)
names(concrete)[9]<-"Strength"
concrete_norm <- as.data.frame(lapply(concrete, normalize))
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
3 of 11 2/18/2016 5:07 PM
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2663 0.4000 0.4172 0.5457 1.0000
These transformations should be typical of a generic formulation where ingredients and process variables are
independent or input variables, and property is a dependent or output variable.
Step 3: Train and Test a Model
The method we’ll follow now is a standard approach where we randomly split the data set in a train and test set.
The caret (https://cran.r-project.org/web/packages/caret/caret.pdf) package implements this task well. We’ll use
75% of the data to train and the remainder to test the model. To make the analysis reproducible, we’ll use the
set.seed() function.
set.seed(12121)
inTrain<-createDataPartition(y=concrete_norm$Strength,p=0.75,list=FALSE)
concrete_train <- concrete_norm[inTrain, ]
concrete_test <- concrete_norm[-inTrain, ]
concrete_model <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Superp
lasticizer + Coarse + Fine + Age, data = concrete_train)
The network topology is then easily visualized. See for details the excellent NeuralNetTool page
(https://beckmw.wordpress.com/tag/neural-network/). Suffise to say that even this simple first attempt highlights
the main dependencies and higher impacts are highligted with thicker links in the typical neural net
representation. here the I’s represent inputs, O is the output, H and B are Hidden and Bias nodes as defined in
the theory. Note a single bias node is added for each input and hidden layer to accomodate input features equal
to 0.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
4 of 11 2/18/2016 5:07 PM
Step 4: Evaluate Model Performance
We will now compute predictions and compare them to actual strength and examine the correlation between
predicted and actual strength values.
model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result
cor(predicted_strength, concrete_test$Strength)[1,1]
## [1] 0.8336352308
The default neural net exhibits a correlation of 0.8336352. We can certainly try to improve it by including a few
hidden neurons…
Step 5: Improving Model Performance with 5
Hidden Neurons
concrete_model2 <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Super
plasticizer + Coarse + Fine + Age, data = concrete_train, hidden=5)
plot.nnet(concrete_model2,cex.val=0.75)
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
5 of 11 2/18/2016 5:07 PM
We observe age remains a key contributor, but the effects of Water, Superplasticizer, Cement and Blast are also
visibly ranked. 4 out of the 5 Hidden nodes are about evenly contributing…
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$Strength)[1,1]
## [1] 0.9138705528
p <- plot(concrete_test$Strength,predicted_strength2)
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
6 of 11 2/18/2016 5:07 PM
Step 6: Improving Model using Random Forest
Algorithm
We now will rely on the Random Forest algorithm and attempt a model improvement.
model_result3 <- train(Strength ~ ., data = concrete_train,method='rf',prox=TRUE)
model_result3
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
7 of 11 2/18/2016 5:07 PM
## Random Forest
##
## 774 samples
## 8 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 774, 774, 774, 774, 774, 774, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.07833903729 0.8774837294 0.004515965257 0.01607292866
## 5 0.07159729020 0.8869237640 0.004510365293 0.01395064039
## 8 0.07365915696 0.8777421774 0.005197103383 0.01691711063
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
predicted_strength3 <- predict(model_result3,concrete_test)
cor(predicted_strength3, concrete_test$Strength)
## [1] 0.943384811
The default Random Forest algorithm helped improve our prediction and exhibits a correlation of 0.9433848.
Again, we can certainly try to improve by introducing resampling… The caret package offers multiple methods
to try out. We’ll just try one to give an idea…
Step 7: Testing Further with Resampling
model_result4 <- train(Strength ~ ., method='rf',data = concrete_train,verbose=FALSE,t
rControl = trainControl(method="cv"))
model_result4
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
8 of 11 2/18/2016 5:07 PM
## Random Forest
##
## 774 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 696, 696, 695, 697, 698, 696, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.06910599 0.9068129 0.01101706 0.04013275
## 5 0.06227778 0.9121046 0.01232558 0.03945448
## 8 0.06287753 0.9089120 0.01208088 0.03835530
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
predicted_strength4 <- predict(model_result4,concrete_test)
cor(predicted_strength4, concrete_test$Strength)
## [1] 0.9428245
We observe the prediction is practically unchanged in this case, with a correlation of 0.9428245. Still not bad for
a quick analysis performed on existing data. Regardless of our property target, we already derived key areas to
investigate deeper… and can clearly see some key ingredients (Cement, Blast, Fly, Superplasticizer, Water…)
and the Age process as determining factors to produce strength performance. So naturally, one may want to
display this model.
Step 8: Actual Display of a Random Forest Tree
Solution
It turns out that so-called blackbox models are – well – meant to stay in their box! However, the rpart
(https://cran.r-project.org/web/packages/rpart/rpart.pdf) and rpart.plot (https://cran.r-project.org/web/packages
/rpart.plot/rpart.plot.pdf) packages make it easy to visualize even complex trees.
strength.tree <- rpart(Strength ~ .,data=concrete_train, control=rpart.control(minspli
t=20,cp=0.002))
prp(strength.tree,compress=TRUE)
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
9 of 11 2/18/2016 5:07 PM
In the network, normalized strength is indicated in the oval leaves, and are ranked from low to high from left to
right branches.
Conclusions
We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help
resolve formulation challenges, offering a fast, efficient and economical alternative to tedious experimentation. It
is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or
any scientific area.
Rubber formulations to minimize rolling resistance and emissions, or modern composites to build renewable
energy sources or lighweight transportation vehicles and next-generation public transit, as well as innovative
UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the nature of
inputs and outputs vary. Therefore, the method can and should be applied broadly!
Next time, we’ll review how to address another common challenge: classification and clustering. Till then, we
hope this approach has triggered interest.
Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC
Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one
customer at the time!
References
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
10 of 11 2/18/2016 5:07 PM
The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1.
caret (https://cran.r-project.org/web/packages/caret/caret.pdf)2.
NeuralNetTool (https://beckmw.wordpress.com/tag/neural-network/)3.
rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)4.
rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)5.
RStudio (https://www.rstudio.com)6.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
11 of 11 2/18/2016 5:07 PM

More Related Content

Viewers also liked

Deus de promessas
Deus de promessasDeus de promessas
Deus de promessas
Lucas Paula
 
Azizi fonctions des icones sous spss11.5
Azizi fonctions des icones sous spss11.5Azizi fonctions des icones sous spss11.5
Azizi fonctions des icones sous spss11.5
Souad Azizi
 
Historical Foundations of Management
Historical Foundations of ManagementHistorical Foundations of Management
Historical Foundations of Management
Leigh Canvas
 

Viewers also liked (13)

Shyrley n°15 5-c
Shyrley n°15 5-cShyrley n°15 5-c
Shyrley n°15 5-c
 
Inflation
InflationInflation
Inflation
 
Presentacion multimedia claudia hilares_5_c
Presentacion multimedia claudia hilares_5_cPresentacion multimedia claudia hilares_5_c
Presentacion multimedia claudia hilares_5_c
 
Deus de promessas
Deus de promessasDeus de promessas
Deus de promessas
 
Influencia del inetrnet en la sociedad silvia cardenas 5_c
Influencia del inetrnet en la sociedad silvia cardenas 5_cInfluencia del inetrnet en la sociedad silvia cardenas 5_c
Influencia del inetrnet en la sociedad silvia cardenas 5_c
 
Blog
BlogBlog
Blog
 
Azizi fonctions des icones sous spss11.5
Azizi fonctions des icones sous spss11.5Azizi fonctions des icones sous spss11.5
Azizi fonctions des icones sous spss11.5
 
Tax Incidence Webinar slides
Tax Incidence Webinar slidesTax Incidence Webinar slides
Tax Incidence Webinar slides
 
Historical Foundations of Management
Historical Foundations of ManagementHistorical Foundations of Management
Historical Foundations of Management
 
Qué es un dominio de Internet
Qué es un dominio de InternetQué es un dominio de Internet
Qué es un dominio de Internet
 
Incidence of tax
Incidence of taxIncidence of tax
Incidence of tax
 
Master degree Safety Resume
Master degree Safety ResumeMaster degree Safety Resume
Master degree Safety Resume
 
101 lecture 6
101 lecture 6101 lecture 6
101 lecture 6
 

Similar to Machine learning key to your formulation challenges

Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic Research
Miklos Koren
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draft
Steve Feldman
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 
Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013
Valeriy Kravchuk
 
10 Ways To Improve Your Code( Neal Ford)
10  Ways To  Improve  Your  Code( Neal  Ford)10  Ways To  Improve  Your  Code( Neal  Ford)
10 Ways To Improve Your Code( Neal Ford)
guestebde
 

Similar to Machine learning key to your formulation challenges (20)

Computer Tools for Academic Research
Computer Tools for Academic ResearchComputer Tools for Academic Research
Computer Tools for Academic Research
 
B2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draftB2 2005 introduction_load_testing_blackboard_primer_draft
B2 2005 introduction_load_testing_blackboard_primer_draft
 
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and HistogramsOpen Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
RivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and HistogramsRivieraJUG - MySQL Indexes and Histograms
RivieraJUG - MySQL Indexes and Histograms
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013Performance schema in_my_sql_5.6_pluk2013
Performance schema in_my_sql_5.6_pluk2013
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Ember
EmberEmber
Ember
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Lecture-6-7.pptx
Lecture-6-7.pptxLecture-6-7.pptx
Lecture-6-7.pptx
 
Benchmarking_ML_Tools
Benchmarking_ML_ToolsBenchmarking_ML_Tools
Benchmarking_ML_Tools
 
10 Ways To Improve Your Code( Neal Ford)
10  Ways To  Improve  Your  Code( Neal  Ford)10  Ways To  Improve  Your  Code( Neal  Ford)
10 Ways To Improve Your Code( Neal Ford)
 
Testing Experience - Evolution of Test Automation Frameworks
Testing Experience - Evolution of Test Automation FrameworksTesting Experience - Evolution of Test Automation Frameworks
Testing Experience - Evolution of Test Automation Frameworks
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
yhavx
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
mikehavy0
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 

Recently uploaded (20)

Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
一比一原版(Monash毕业证书)莫纳什大学毕业证原件一模一样
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...Solution manual for managerial accounting 8th edition by john wild ken shaw b...
Solution manual for managerial accounting 8th edition by john wild ken shaw b...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptxChapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
Chapter 1 - Introduction to Data Mining Concepts and Techniques.pptx
 

Machine learning key to your formulation challenges

  • 1. Machine Learning, Key to Your Formulation Challenges Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net) February 17, 2016 Formulation Challenges are Everywhere… Step1: Retrieve Existing Data Step 2: Normalize Data Step 3: Train and Test a Model Step 4: Evaluate Model Performance Step 5: Improving Model Performance with 5 Hidden Neurons Step 6: Improving Model using Random Forest Algorithm Step 7: Testing Further with Resampling Step 8: Actual Display of a Random Forest Tree Solution Conclusions References Formulation Challenges are Everywhere… You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted with the challenge of blending and formulating to meet process or performance properties. While traditional Research and Development does approach the problem with experimentation, it generally involves designs, time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten or perhaps obsolete. Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation and classification. Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation challenge. We will discuss the other important aspect of classification and clustering often associated with these formulations challenges in a forthcoming communication. Step1: Retrieve Existing Data To illustrate the approach, we selected a formulation dataset hosted on UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html), to predict the compressive strength (http://archive.ics.uci.edu /ml/datasets/Concrete+Compressive+Strength) performance dependency on the formulation ingredients. This is the well-known formulation composition - property relationship scientists, engineers and business professionals must address daily and any established R&D would certainly have similar and sometimes hidden knowledge in its archives… We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine- learning-databases/concrete/compressive/Concrete_Data.xls), and also demonstrate how reproducibility of the analysis is enforced. The analysis tool and platform are documented, all libraries clearly listed, while data is retrieved programmatically and date stamped from the repository. Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 1 of 11 2/18/2016 5:07 PM
  • 2. Sys.info()[1:5] ## sysname release version nodename machine ## "Windows" "7 x64" "build 9200" "STALLION" "x86-64" sessionInfo() ## R version 3.2.2 (2015-08-14) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 8 x64 (build 9200) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] grid stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] reshape_0.8.5 scales_0.3.0 devtools_1.8.0 ## [4] rpart.plot_1.5.3 rpart_4.1-10 randomForest_4.6-10 ## [7] neuralnet_1.32 MASS_7.3-44 caret_6.0-52 ## [10] ggplot2_1.0.1 lattice_0.20-33 stringr_1.0.0 ## [13] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-7 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.0 git2r_0.11.0 formatR_1.2 ## [4] nloptr_1.0.4 plyr_1.8.3 iterators_1.0.7 ## [7] tools_3.2.2 digest_0.6.8 lme4_1.1-9 ## [10] memoise_0.2.1 evaluate_0.7.2 gtable_0.1.2 ## [13] nlme_3.1-121 mgcv_1.8-9 Matrix_1.2-2 ## [16] foreach_1.4.2 curl_0.9.3 parallel_3.2.2 ## [19] yaml_2.1.13 SparseM_1.7 brglm_0.5-9 ## [22] proto_0.3-10 xml2_0.1.1 BradleyTerry2_1.0-6 ## [25] knitr_1.11 rversions_1.0.2 MatrixModels_0.4-1 ## [28] gtools_3.5.0 stats4_3.2.2 nnet_7.3-11 ## [31] rmarkdown_0.9.2 minqa_1.2.4 reshape2_1.4.1 ## [34] car_2.0-26 magrittr_1.5 codetools_0.2-14 ## [37] htmltools_0.2.6 splines_3.2.2 pbkrtest_0.4-2 ## [40] colorspace_1.2-6 quantreg_5.18 stringi_0.5-5 ## [43] munsell_0.4.2 Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 2 of 11 2/18/2016 5:07 PM
  • 3. library(xlsx) library(stringr) library(caret) library(neuralnet) library(devtools) library(rpart) library(rpart.plot) userdir <- getwd() datadir <- "./data" if (!file.exists("data")){dir.create("data")} fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compress ive/Concrete_Data.xls?accessType=DOWNLOAD" download.file(fileUrl,destfile="./data/Concrete_Data.xls",method="curl") dateDownloaded <- date() concrete <- read.xlsx("./data/Concrete_Data.xls",sheetName="Sheet1") str(concrete) ## 'data.frame': 1030 obs. of 9 variables: ## $ Cement..component.1..kg.in.a.m.3.mixture. : num 540 540 332 332 199 ... ## $ Blast.Furnace.Slag..component.2..kg.in.a.m.3.mixture.: num 0 0 142 142 132 ... ## $ Fly.Ash..component.3..kg.in.a.m.3.mixture. : num 0 0 0 0 0 0 0 0 0 0 ... ## $ Water...component.4..kg.in.a.m.3.mixture. : num 162 162 228 228 192 228 228 228 228 228 ... ## $ Superplasticizer..component.5..kg.in.a.m.3.mixture. : num 2.5 2.5 0 0 0 0 0 0 0 0 ... ## $ Coarse.Aggregate...component.6..kg.in.a.m.3.mixture. : num 1040 1055 932 932 97 8 ... ## $ Fine.Aggregate..component.7..kg.in.a.m.3.mixture. : num 676 676 594 594 826 ... ## $ Age..day. : num 28 28 270 365 360 90 365 28 28 28 ... ## $ Concrete.compressive.strength.MPa..megapascals.. : num 80 61.9 40.3 41.1 44 .3 ... Step 2: Normalize Data The dataset information reveals 1030 observations with 9 variables: 8 inputs, from which 7 are ingredients and 1 is a process attribute (Age) and 1 output, the strength property. There are no missing values in this set. We’ll easily truncate the variable names and normalize the data, displaying the normalized strength. normalize <- function(x) {return((x - min(x)) / (max(x) - min(x)))} names(concrete)<-gsub("."," ",names(concrete)) names(concrete)<-word(names(concrete),1) names(concrete)[9]<-"Strength" concrete_norm <- as.data.frame(lapply(concrete, normalize)) Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 3 of 11 2/18/2016 5:07 PM
  • 4. ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.2663 0.4000 0.4172 0.5457 1.0000 These transformations should be typical of a generic formulation where ingredients and process variables are independent or input variables, and property is a dependent or output variable. Step 3: Train and Test a Model The method we’ll follow now is a standard approach where we randomly split the data set in a train and test set. The caret (https://cran.r-project.org/web/packages/caret/caret.pdf) package implements this task well. We’ll use 75% of the data to train and the remainder to test the model. To make the analysis reproducible, we’ll use the set.seed() function. set.seed(12121) inTrain<-createDataPartition(y=concrete_norm$Strength,p=0.75,list=FALSE) concrete_train <- concrete_norm[inTrain, ] concrete_test <- concrete_norm[-inTrain, ] concrete_model <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Superp lasticizer + Coarse + Fine + Age, data = concrete_train) The network topology is then easily visualized. See for details the excellent NeuralNetTool page (https://beckmw.wordpress.com/tag/neural-network/). Suffise to say that even this simple first attempt highlights the main dependencies and higher impacts are highligted with thicker links in the typical neural net representation. here the I’s represent inputs, O is the output, H and B are Hidden and Bias nodes as defined in the theory. Note a single bias node is added for each input and hidden layer to accomodate input features equal to 0. Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 4 of 11 2/18/2016 5:07 PM
  • 5. Step 4: Evaluate Model Performance We will now compute predictions and compare them to actual strength and examine the correlation between predicted and actual strength values. model_results <- compute(concrete_model, concrete_test[1:8]) predicted_strength <- model_results$net.result cor(predicted_strength, concrete_test$Strength)[1,1] ## [1] 0.8336352308 The default neural net exhibits a correlation of 0.8336352. We can certainly try to improve it by including a few hidden neurons… Step 5: Improving Model Performance with 5 Hidden Neurons concrete_model2 <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Super plasticizer + Coarse + Fine + Age, data = concrete_train, hidden=5) plot.nnet(concrete_model2,cex.val=0.75) Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 5 of 11 2/18/2016 5:07 PM
  • 6. We observe age remains a key contributor, but the effects of Water, Superplasticizer, Cement and Blast are also visibly ranked. 4 out of the 5 Hidden nodes are about evenly contributing… model_results2 <- compute(concrete_model2, concrete_test[1:8]) predicted_strength2 <- model_results2$net.result cor(predicted_strength2, concrete_test$Strength)[1,1] ## [1] 0.9138705528 p <- plot(concrete_test$Strength,predicted_strength2) Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 6 of 11 2/18/2016 5:07 PM
  • 7. Step 6: Improving Model using Random Forest Algorithm We now will rely on the Random Forest algorithm and attempt a model improvement. model_result3 <- train(Strength ~ ., data = concrete_train,method='rf',prox=TRUE) model_result3 Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 7 of 11 2/18/2016 5:07 PM
  • 8. ## Random Forest ## ## 774 samples ## 8 predictor ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 774, 774, 774, 774, 774, 774, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared RMSE SD Rsquared SD ## 2 0.07833903729 0.8774837294 0.004515965257 0.01607292866 ## 5 0.07159729020 0.8869237640 0.004510365293 0.01395064039 ## 8 0.07365915696 0.8777421774 0.005197103383 0.01691711063 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 5. predicted_strength3 <- predict(model_result3,concrete_test) cor(predicted_strength3, concrete_test$Strength) ## [1] 0.943384811 The default Random Forest algorithm helped improve our prediction and exhibits a correlation of 0.9433848. Again, we can certainly try to improve by introducing resampling… The caret package offers multiple methods to try out. We’ll just try one to give an idea… Step 7: Testing Further with Resampling model_result4 <- train(Strength ~ ., method='rf',data = concrete_train,verbose=FALSE,t rControl = trainControl(method="cv")) model_result4 Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 8 of 11 2/18/2016 5:07 PM
  • 9. ## Random Forest ## ## 774 samples ## 8 predictor ## ## No pre-processing ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 696, 696, 695, 697, 698, 696, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared RMSE SD Rsquared SD ## 2 0.06910599 0.9068129 0.01101706 0.04013275 ## 5 0.06227778 0.9121046 0.01232558 0.03945448 ## 8 0.06287753 0.9089120 0.01208088 0.03835530 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 5. predicted_strength4 <- predict(model_result4,concrete_test) cor(predicted_strength4, concrete_test$Strength) ## [1] 0.9428245 We observe the prediction is practically unchanged in this case, with a correlation of 0.9428245. Still not bad for a quick analysis performed on existing data. Regardless of our property target, we already derived key areas to investigate deeper… and can clearly see some key ingredients (Cement, Blast, Fly, Superplasticizer, Water…) and the Age process as determining factors to produce strength performance. So naturally, one may want to display this model. Step 8: Actual Display of a Random Forest Tree Solution It turns out that so-called blackbox models are – well – meant to stay in their box! However, the rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf) and rpart.plot (https://cran.r-project.org/web/packages /rpart.plot/rpart.plot.pdf) packages make it easy to visualize even complex trees. strength.tree <- rpart(Strength ~ .,data=concrete_train, control=rpart.control(minspli t=20,cp=0.002)) prp(strength.tree,compress=TRUE) Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 9 of 11 2/18/2016 5:07 PM
  • 10. In the network, normalized strength is indicated in the oval leaves, and are ranked from low to high from left to right branches. Conclusions We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help resolve formulation challenges, offering a fast, efficient and economical alternative to tedious experimentation. It is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or any scientific area. Rubber formulations to minimize rolling resistance and emissions, or modern composites to build renewable energy sources or lighweight transportation vehicles and next-generation public transit, as well as innovative UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the nature of inputs and outputs vary. Therefore, the method can and should be applied broadly! Next time, we’ll review how to address another common challenge: classification and clustering. Till then, we hope this approach has triggered interest. Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one customer at the time! References Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 10 of 11 2/18/2016 5:07 PM
  • 11. The following sources are referenced as they provided significant help and information to develop this Machine Learning analysis applied to formulations: UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1. caret (https://cran.r-project.org/web/packages/caret/caret.pdf)2. NeuralNetTool (https://beckmw.wordpress.com/tag/neural-network/)3. rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)4. rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)5. RStudio (https://www.rstudio.com)6. Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html 11 of 11 2/18/2016 5:07 PM