Building efficient machine learning applications is not a simple task. The typical engineering process is an iteration of data wrangling, feature generation, model selection, hyperparameter tuning and evaluation. The amount of possible variations of input features, algorithms and parameters makes it too complex to perform efficiently even by experts. Automating this process is especially important when building machine learning applications for thousands of customers. In this talk I demonstrate how we build effective ML models using AutoML capabilities we develop at Salesforce. Our AutoML capabilities include techniques for automatic data processing, feature generation, model selection, hyperparameter tuning and evaluation. I present several of the implemented solutions with Scala and Spark.
2. “This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of our own neighborhood is the
exception.”
– Powers of Ten (1977), by Charles and Ray Eames
3. Powers of Ten (1977)
A travel between a
quark and the
observable universe
[10-17, 1024]
4. Powers of Ten for Machine Learning
• Data collection
• Data preparation
• Feature engineering
• Feature selection
• Sampling
• Algorithm implementation
• Hyperparameter tuning
• Model selection
• Model serving (scoring)
• Prediction insights
• Metrics
5. a) Hours
b) Days
c) Weeks
d) Months
e) More
How long does it take to build a
machine learning application?
6. How to cope with this complexity?
E = mc2
Free[F[_], A]
M[A]
Functor[F[_]]
Cofree[S[_], A]
Months -> Hours
7. “The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch
9. Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define
• Less flexible
• Simpler syntax
• Reuse
• Suitable for
complex problems
• Difficult to use
• More complex
• Error prone
???
10. “FP removes one important dimension
of complexity:
To understand a program part (a
function) you need no longer account
for the possible histories of executions
that can lead to that program part.”
– Martin Odersky
14. Optimus Prime
An AutoML library for building modular,
reusable, strongly typed ML workflows on Spark
• Declarative & intuitive syntax
• Proper level of abstraction
• Aimed for simplicity & reuse
• >90% accuracy with 100X reduction in time
16. Type Safety Everywhere
• Value Operations
• Feature Operations
• Transformation Pipelines (aka Workflow)
// Typed value operations!
def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList!
!
// Typed feature operations!
val title: Feature[Text] =
FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val tokens: Feature[TextList] = title.map(tokenize)!
!
// Transformation pipelines!
new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
17. Book Price Prediction
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize()!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
18. Magic Behind “vectorize()”
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize() // <- magic here!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
20. Automatic Feature Engineering
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Binning
Numeric Categorical SpatialTemporal
Tokenization
Hash Encoding
TF-IDF
Word2Vec
Sentiment Analysis
Language Detection
Time difference
Time Binning
Time extraction
(day, week, month,
year)
Closeness to major
events
Augment with external
data e.g avg income
Spatial fraudulent
behavior e.g:
impossible travel speed
Geo-encoding
Text
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Smart Binning
LabelCount Encoding
Category Embedding
More…
21. Automatic Feature Selection
• Analyze features & calculate statistics
• Ensure features have acceptable ranges
• Is this feature a leaker?
• Does this feature help our model? Is it
predictive?
22. Automatic Feature Selection
// Sanity check your features against the label!
val checked = price.check(!
featureVector = feats,!
checkSample = 0.3,!
sampleSeed = 1L,!
sampleLimit = 100000L,!
maxCorrelation = 0.95,!
minCorrelation = 0.0,!
correlationType = Pearson,!
minVariance = 0.00001,!
removeBadFeatures = true!
)!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
23. Automatic Model Selection
• Multiple algorithms to pick from
• Many hyperparameters for each algorithm
• Automated hyperparameter tuning
– Faster model creation with improved metrics
– Search algorithms to find the optimal
hyperparameters. e.g. grid search, random
search, bandit methods
24. Automatic Model Selection
// Model selection and hyperparameter tuning!
val preds =!
RegressionModelSelector!
.withCrossValidation(!
dataSplitter = DataSplitter(reserveTestFraction = 0.1),!
numFolds = 3,!
validationMetric = Evaluators.Regression.rmse(),!
trainTestEvaluators = Seq.empty,!
seed = 1L)!
.setModelsToTry(LinearRegression, RandomForestRegression)!
.setLinearRegressionElasticNetParam(0, 0.5, 1)!
.setLinearRegressionMaxIter(10, 100)!
.setLinearRegressionSolver(Solver.LBFGS)!
.setRandomForestMaxDepth(2, 10)!
.setRandomForestNumTrees(10)!
.setInput(price, checked).getOutput!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
27. How well does it work?
• Most of our models deployed in production
are completely hands free
• We serve 475,000,000+ predictions per day
28. Fantastic ML apps HOWTO
• Define appropriate level of abstraction
• Use types to express it
• Automate everything:
– feature engineering & selection
– model selection
– hyperparameter tuning
– Etc.
Months -> Hours
29. Further exploration
Talks @ Scale By The Bay 2017:
• “Real Time ML Pipelines in Multi-Tenant Environments” by
Karl Skucha and Yan Yang
• “Fireworks - lighting up the sky with millions of Sparks“ by
Thomas Gerber
• “Functional Linear Algebra in Scala” by Vlad Patryshev
• “Complex Machine Learning Pipelines Made Easy” by Chris
Rupley and Till Bergmann
• “Just enough DevOps for data scientists” by Anya Bida