Machine learning key to your formulation challenges

Machine Learning, Key to Your Formulation
Challenges
Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net)
February 17, 2016
Formulation Challenges are Everywhere…
Step1: Retrieve Existing Data
Step 2: Normalize Data
Step 3: Train and Test a Model
Step 4: Evaluate Model Performance
Step 5: Improving Model Performance with 5 Hidden Neurons
Step 6: Improving Model using Random Forest Algorithm
Step 7: Testing Further with Resampling
Step 8: Actual Display of a Random Forest Tree Solution
Conclusions
References
Formulation Challenges are Everywhere…
You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted
with the challenge of blending and formulating to meet process or performance properties. While traditional
Research and Development does approach the problem with experimentation, it generally involves designs,
time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten
or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and
ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation
and classification.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation
challenge. We will discuss the other important aspect of classification and clustering often associated with these
formulations challenges in a forthcoming communication.
Step1: Retrieve Existing Data
To illustrate the approach, we selected a formulation dataset hosted on UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/datasets.html), to predict the compressive strength (http://archive.ics.uci.edu
/ml/datasets/Concrete+Compressive+Strength) performance dependency on the formulation ingredients. This is
the well-known formulation composition - property relationship scientists, engineers and business professionals
must address daily and any established R&D would certainly have similar and sometimes hidden knowledge in
its archives…
We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine-
learning-databases/concrete/compressive/Concrete_Data.xls), and also demonstrate how reproducibility of the
analysis is enforced. The analysis tool and platform are documented, all libraries clearly listed, while data is
retrieved programmatically and date stamped from the repository.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
1 of 11 2/18/2016 5:07 PM

Sys.info()[1:5]
## sysname release version nodename machine
## "Windows" "7 x64" "build 9200" "STALLION" "x86-64"
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] reshape_0.8.5 scales_0.3.0 devtools_1.8.0
## [4] rpart.plot_1.5.3 rpart_4.1-10 randomForest_4.6-10
## [7] neuralnet_1.32 MASS_7.3-44 caret_6.0-52
## [10] ggplot2_1.0.1 lattice_0.20-33 stringr_1.0.0
## [13] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-7
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 git2r_0.11.0 formatR_1.2
## [4] nloptr_1.0.4 plyr_1.8.3 iterators_1.0.7
## [7] tools_3.2.2 digest_0.6.8 lme4_1.1-9
## [10] memoise_0.2.1 evaluate_0.7.2 gtable_0.1.2
## [13] nlme_3.1-121 mgcv_1.8-9 Matrix_1.2-2
## [16] foreach_1.4.2 curl_0.9.3 parallel_3.2.2
## [19] yaml_2.1.13 SparseM_1.7 brglm_0.5-9
## [22] proto_0.3-10 xml2_0.1.1 BradleyTerry2_1.0-6
## [25] knitr_1.11 rversions_1.0.2 MatrixModels_0.4-1
## [28] gtools_3.5.0 stats4_3.2.2 nnet_7.3-11
## [31] rmarkdown_0.9.2 minqa_1.2.4 reshape2_1.4.1
## [34] car_2.0-26 magrittr_1.5 codetools_0.2-14
## [37] htmltools_0.2.6 splines_3.2.2 pbkrtest_0.4-2
## [40] colorspace_1.2-6 quantreg_5.18 stringi_0.5-5
## [43] munsell_0.4.2
2 of 11 2/18/2016 5:07 PM

library(xlsx)
library(stringr)
library(caret)
library(neuralnet)
library(devtools)
library(rpart)
library(rpart.plot)
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compress
ive/Concrete_Data.xls?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Concrete_Data.xls",method="curl")
dateDownloaded <- date()
concrete <- read.xlsx("./data/Concrete_Data.xls",sheetName="Sheet1")
str(concrete)
## 'data.frame': 1030 obs. of 9 variables:
## $ Cement..component.1..kg.in.a.m.3.mixture. : num 540 540 332 332 199
...
## $ Blast.Furnace.Slag..component.2..kg.in.a.m.3.mixture.: num 0 0 142 142 132 ...
## $ Fly.Ash..component.3..kg.in.a.m.3.mixture. : num 0 0 0 0 0 0 0 0 0 0
...
## $ Water...component.4..kg.in.a.m.3.mixture. : num 162 162 228 228 192
228 228 228 228 228 ...
## $ Superplasticizer..component.5..kg.in.a.m.3.mixture. : num 2.5 2.5 0 0 0 0 0 0
0 0 ...
## $ Coarse.Aggregate...component.6..kg.in.a.m.3.mixture. : num 1040 1055 932 932 97
8 ...
## $ Fine.Aggregate..component.7..kg.in.a.m.3.mixture. : num 676 676 594 594 826
...
## $ Age..day. : num 28 28 270 365 360 90
365 28 28 28 ...
## $ Concrete.compressive.strength.MPa..megapascals.. : num 80 61.9 40.3 41.1 44
.3 ...
Step 2: Normalize Data
The dataset information reveals 1030 observations with 9 variables: 8 inputs, from which 7 are ingredients and
1 is a process attribute (Age) and 1 output, the strength property. There are no missing values in this set. We’ll
easily truncate the variable names and normalize the data, displaying the normalized strength.
normalize <- function(x) {return((x - min(x)) / (max(x) - min(x)))}
names(concrete)<-gsub("."," ",names(concrete))
names(concrete)<-word(names(concrete),1)
names(concrete)[9]<-"Strength"
concrete_norm <- as.data.frame(lapply(concrete, normalize))
3 of 11 2/18/2016 5:07 PM

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2663 0.4000 0.4172 0.5457 1.0000
These transformations should be typical of a generic formulation where ingredients and process variables are
independent or input variables, and property is a dependent or output variable.
Step 3: Train and Test a Model
The method we’ll follow now is a standard approach where we randomly split the data set in a train and test set.
The caret (https://cran.r-project.org/web/packages/caret/caret.pdf) package implements this task well. We’ll use
75% of the data to train and the remainder to test the model. To make the analysis reproducible, we’ll use the
set.seed() function.
set.seed(12121)
inTrain<-createDataPartition(y=concrete_norm$Strength,p=0.75,list=FALSE)
concrete_train <- concrete_norm[inTrain, ]
concrete_test <- concrete_norm[-inTrain, ]
concrete_model <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Superp
lasticizer + Coarse + Fine + Age, data = concrete_train)
The network topology is then easily visualized. See for details the excellent NeuralNetTool page
(https://beckmw.wordpress.com/tag/neural-network/). Suffise to say that even this simple first attempt highlights
the main dependencies and higher impacts are highligted with thicker links in the typical neural net
representation. here the I’s represent inputs, O is the output, H and B are Hidden and Bias nodes as defined in
the theory. Note a single bias node is added for each input and hidden layer to accomodate input features equal
to 0.
4 of 11 2/18/2016 5:07 PM

Step 4: Evaluate Model Performance
We will now compute predictions and compare them to actual strength and examine the correlation between
predicted and actual strength values.
model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result
cor(predicted_strength, concrete_test$Strength)[1,1]
## [1] 0.8336352308
The default neural net exhibits a correlation of 0.8336352. We can certainly try to improve it by including a few
hidden neurons…
Step 5: Improving Model Performance with 5
Hidden Neurons
concrete_model2 <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Super
plasticizer + Coarse + Fine + Age, data = concrete_train, hidden=5)
plot.nnet(concrete_model2,cex.val=0.75)
5 of 11 2/18/2016 5:07 PM

We observe age remains a key contributor, but the effects of Water, Superplasticizer, Cement and Blast are also
visibly ranked. 4 out of the 5 Hidden nodes are about evenly contributing…
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$Strength)[1,1]
## [1] 0.9138705528
p <- plot(concrete_test$Strength,predicted_strength2)
6 of 11 2/18/2016 5:07 PM

Step 6: Improving Model using Random Forest
Algorithm
We now will rely on the Random Forest algorithm and attempt a model improvement.
model_result3 <- train(Strength ~ ., data = concrete_train,method='rf',prox=TRUE)
model_result3
7 of 11 2/18/2016 5:07 PM

## Random Forest
##
## 774 samples
## 8 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 774, 774, 774, 774, 774, 774, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.07833903729 0.8774837294 0.004515965257 0.01607292866
## 5 0.07159729020 0.8869237640 0.004510365293 0.01395064039
## 8 0.07365915696 0.8777421774 0.005197103383 0.01691711063
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
predicted_strength3 <- predict(model_result3,concrete_test)
cor(predicted_strength3, concrete_test$Strength)
## [1] 0.943384811
The default Random Forest algorithm helped improve our prediction and exhibits a correlation of 0.9433848.
Again, we can certainly try to improve by introducing resampling… The caret package offers multiple methods
to try out. We’ll just try one to give an idea…
Step 7: Testing Further with Resampling
model_result4 <- train(Strength ~ ., method='rf',data = concrete_train,verbose=FALSE,t
rControl = trainControl(method="cv"))
model_result4
8 of 11 2/18/2016 5:07 PM

## Random Forest
##
## 774 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 696, 696, 695, 697, 698, 696, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.06910599 0.9068129 0.01101706 0.04013275
## 5 0.06227778 0.9121046 0.01232558 0.03945448
## 8 0.06287753 0.9089120 0.01208088 0.03835530
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
predicted_strength4 <- predict(model_result4,concrete_test)
cor(predicted_strength4, concrete_test$Strength)
## [1] 0.9428245
We observe the prediction is practically unchanged in this case, with a correlation of 0.9428245. Still not bad for
a quick analysis performed on existing data. Regardless of our property target, we already derived key areas to
investigate deeper… and can clearly see some key ingredients (Cement, Blast, Fly, Superplasticizer, Water…)
and the Age process as determining factors to produce strength performance. So naturally, one may want to
display this model.
Step 8: Actual Display of a Random Forest Tree
Solution
It turns out that so-called blackbox models are – well – meant to stay in their box! However, the rpart
(https://cran.r-project.org/web/packages/rpart/rpart.pdf) and rpart.plot (https://cran.r-project.org/web/packages
/rpart.plot/rpart.plot.pdf) packages make it easy to visualize even complex trees.
strength.tree <- rpart(Strength ~ .,data=concrete_train, control=rpart.control(minspli
t=20,cp=0.002))
prp(strength.tree,compress=TRUE)
9 of 11 2/18/2016 5:07 PM

In the network, normalized strength is indicated in the oval leaves, and are ranked from low to high from left to
right branches.
Conclusions
We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help
resolve formulation challenges, offering a fast, efficient and economical alternative to tedious experimentation. It
is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or
any scientific area.
Rubber formulations to minimize rolling resistance and emissions, or modern composites to build renewable
energy sources or lighweight transportation vehicles and next-generation public transit, as well as innovative
UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the nature of
inputs and outputs vary. Therefore, the method can and should be applied broadly!
Next time, we’ll review how to address another common challenge: classification and clustering. Till then, we
hope this approach has triggered interest.
Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC
Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one
customer at the time!
References
10 of 11 2/18/2016 5:07 PM

The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1.
caret (https://cran.r-project.org/web/packages/caret/caret.pdf)2.
NeuralNetTool (https://beckmw.wordpress.com/tag/neural-network/)3.
rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)4.
rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)5.
RStudio (https://www.rstudio.com)6.
11 of 11 2/18/2016 5:07 PM

Machine learning key to your formulation challenges

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (13)

Similaire à Machine learning key to your formulation challenges

Similaire à Machine learning key to your formulation challenges (20)

Dernier

Dernier (20)

Machine learning key to your formulation challenges