Fantastic ML apps and how to build them

Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Fantastic ML apps and how to
build them

“This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of our own neighborhood is the
exception.”
– Powers of Ten (1977), by Charles and Ray Eames

Powers of Ten (1977)
A travel between a
quark and the
observable universe
[10-17, 1024]

Powers of Ten for Machine Learning
•  Data collection
•  Data preparation
•  Feature engineering
•  Feature selection
•  Sampling
•  Algorithm implementation
•  Hyperparameter tuning
•  Model selection
•  Model serving (scoring)
•  Prediction insights
•  Metrics

a)  Hours
b)  Days
c)  Weeks
d)  Months
e)  More
How long does it take to build a
machine learning application?

How to cope with this complexity?
E = mc2

Free[F[_], A]

M[A]

Functor[F[_]]

Cofree[S[_], A]

Months -> Hours

“The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch

Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define
•  Less flexible
•  Simpler syntax
•  Reuse
•  Suitable for
complex problems
•  Difficult to use
•  More complex
•  Error prone
???

“FP removes one important dimension
of complexity:
To understand a program part (a
function) you need no longer account
for the possible histories of executions
that can lead to that program part.”
– Martin Odersky

Functional Approach
•  Type-safe
•  No side eﬀects
•  Composability
•  Concise
•  Fine-grained control
// Extracting URL features!
def urlFeatures(s: String): (Text, Text) = { !
val url = Url(s)!
url.protocol -> url.domain!
}!
Seq("http://einstein.com", “”).map(urlFeatures)!
!
> Seq((Text(“http”), Text(“einstein.com”),!
(Text(), Text()))!

Object-oriented Approach
•  Modularity
•  Code reuse
•  Polymorphism
// Extracting text features!
val txt = Seq(!
Url("http://einstein.com"),!
Base64("b25lIHR3byB0aHJlZQ==”),!
Text(”Hello world!”),!
Phone(”650-123-4567”)!
Text.empty !
)!
txt.map(_.tokenize)!
!
Seq(!
TextList(“http”, “einstein.com”),!
TextList(“one”, “two”, “three”),!
TextList(“Hello”, “world”),!
TextList(“+1”, “650”, “1234567”),!
TextList()!
)!

Why Scala?
•  Combines FP & OOP
•  Strongly-typed
•  Expressive
•  Concise
•  Fun (mostly)
•  Default for Spark

Optimus Prime
An AutoML library for building modular,
reusable, strongly typed ML workﬂows on Spark
•  Declarative & intuitive syntax
•  Proper level of abstraction
•  Aimed for simplicity & reuse
•  >90% accuracy with 100X reduction in time

FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullable
TextEmail
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
…
TextList
City
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
Types Hide the Complexity

Type Safety Everywhere
•  Value Operations
•  Feature Operations
•  Transformation Pipelines (aka Workﬂow)
// Typed value operations!
def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList!
!
// Typed feature operations!
val title: Feature[Text] =
FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val tokens: Feature[TextList] = title.map(tokenize)!
!
// Transformation pipelines!
new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!

Book Price Prediction
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize()!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!

Magic Behind “vectorize()”
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize() // <- magic here!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!

Automatic Feature Engineering
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone
Is Valid
Top TF-
IDF
Terms
Average
Income
Vector

Automatic Feature Engineering
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Binning
Numeric Categorical SpatialTemporal
Tokenization
Hash Encoding
TF-IDF
Word2Vec
Sentiment Analysis
Language Detection
Time diﬀerence
Time Binning
Time extraction
(day, week, month,
year)
Closeness to major
events
Augment with external
data e.g avg income
Spatial fraudulent
behavior e.g:
impossible travel speed
Geo-encoding
Text
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Smart Binning
LabelCount Encoding
Category Embedding
More…

Automatic Feature Selection
•  Analyze features & calculate statistics
•  Ensure features have acceptable ranges
•  Is this feature a leaker?
•  Does this feature help our model? Is it
predictive?

Automatic Feature Selection
// Sanity check your features against the label!
val checked = price.check(!
featureVector = feats,!
checkSample = 0.3,!
sampleSeed = 1L,!
sampleLimit = 100000L,!
maxCorrelation = 0.95,!
minCorrelation = 0.0,!
correlationType = Pearson,!
minVariance = 0.00001,!
removeBadFeatures = true!
)!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!

Automatic Model Selection
•  Multiple algorithms to pick from
•  Many hyperparameters for each algorithm
•  Automated hyperparameter tuning
–  Faster model creation with improved metrics
–  Search algorithms to ﬁnd the optimal
hyperparameters. e.g. grid search, random
search, bandit methods

Automatic Model Selection
// Model selection and hyperparameter tuning!
val preds =!
RegressionModelSelector!
.withCrossValidation(!
dataSplitter = DataSplitter(reserveTestFraction = 0.1),!
numFolds = 3,!
validationMetric = Evaluators.Regression.rmse(),!
trainTestEvaluators = Seq.empty,!
seed = 1L)!
.setModelsToTry(LinearRegression, RandomForestRegression)!
.setLinearRegressionElasticNetParam(0, 0.5, 1)!
.setLinearRegressionMaxIter(10, 100)!
.setLinearRegressionSolver(Solver.LBFGS)!
.setRandomForestMaxDepth(2, 10)!
.setRandomForestNumTrees(10)!
.setInput(price, checked).getOutput!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!

How well does it work?
•  Most of our models deployed in production
are completely hands free
•  We serve 475,000,000+ predictions per day

Fantastic ML apps HOWTO
•  Deﬁne appropriate level of abstraction
•  Use types to express it
•  Automate everything:
–  feature engineering & selection
–  model selection
–  hyperparameter tuning
–  Etc.
Months -> Hours

Further exploration
Talks @ Scale By The Bay 2017:
•  “Real Time ML Pipelines in Multi-Tenant Environments” by
Karl Skucha and Yan Yang
•  “Fireworks - lighting up the sky with millions of Sparks“ by
Thomas Gerber
•  “Functional Linear Algebra in Scala” by Vlad Patryshev
•  “Complex Machine Learning Pipelines Made Easy” by Chris
Rupley and Till Bergmann
•  “Just enough DevOps for data scientists” by Anya Bida

We are hiring!
einstein-recruiting@salesforce.com

Fantastic ML apps and how to build them

Recommended

Recommended

More Related Content

Similar to Fantastic ML apps and how to build them

Similar to Fantastic ML apps and how to build them (20)

Recently uploaded

Recently uploaded (20)

Fantastic ML apps and how to build them