SlideShare a Scribd company logo
1 of 31
Download to read offline
Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Fantastic ML apps and how to
build them
“This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of our own neighborhood is the
exception.”
– Powers of Ten (1977), by Charles and Ray Eames
Powers of Ten (1977)
A travel between a
quark and the
observable universe
[10-17, 1024]
Powers of Ten for Machine Learning
•  Data collection
•  Data preparation
•  Feature engineering
•  Feature selection
•  Sampling
•  Algorithm implementation
•  Hyperparameter tuning
•  Model selection
•  Model serving (scoring)
•  Prediction insights
•  Metrics
a)  Hours
b)  Days
c)  Weeks
d)  Months
e)  More
How long does it take to build a
machine learning application?
How to cope with this complexity?
E = mc2


Free[F[_], A]

M[A]

Functor[F[_]]

Cofree[S[_], A]

Months -> Hours
“The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch
Complexity vs. Abstraction
Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define
•  Less flexible
•  Simpler syntax
•  Reuse
•  Suitable for
complex problems
•  Difficult to use
•  More complex
•  Error prone
???
“FP removes one important dimension
of complexity:
To understand a program part (a
function) you need no longer account
for the possible histories of executions
that can lead to that program part.”
– Martin Odersky
Functional Approach
•  Type-safe
•  No side effects
•  Composability
•  Concise
•  Fine-grained control
// Extracting URL features!
def urlFeatures(s: String): (Text, Text) = { !
val url = Url(s)!
url.protocol -> url.domain!
}!
Seq("http://einstein.com", “”).map(urlFeatures)!
!
> Seq((Text(“http”), Text(“einstein.com”),!
(Text(), Text()))!
Object-oriented Approach
•  Modularity
•  Code reuse
•  Polymorphism
// Extracting text features!
val txt = Seq(!
Url("http://einstein.com"),!
Base64("b25lIHR3byB0aHJlZQ==”),!
Text(”Hello world!”),!
Phone(”650-123-4567”)!
Text.empty !
)!
txt.map(_.tokenize)!
!
Seq(!
TextList(“http”, “einstein.com”),!
TextList(“one”, “two”, “three”),!
TextList(“Hello”, “world”),!
TextList(“+1”, “650”, “1234567”),!
TextList()!
)!
Why Scala?
•  Combines FP & OOP
•  Strongly-typed
•  Expressive
•  Concise
•  Fun (mostly)
•  Default for Spark
Optimus Prime
An AutoML library for building modular,
reusable, strongly typed ML workflows on Spark
•  Declarative & intuitive syntax
•  Proper level of abstraction
•  Aimed for simplicity & reuse
•  >90% accuracy with 100X reduction in time
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullable
TextEmail
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
…
TextList
City
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
Types Hide the Complexity
Type Safety Everywhere
•  Value Operations
•  Feature Operations
•  Transformation Pipelines (aka Workflow)
// Typed value operations!
def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList!
!
// Typed feature operations!
val title: Feature[Text] =
FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val tokens: Feature[TextList] = title.map(tokenize)!
!
// Transformation pipelines!
new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
Book Price Prediction
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize()!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
Magic Behind “vectorize()”
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize() // <- magic here!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
Automatic Feature Engineering
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone
Is Valid
Top TF-
IDF
Terms
Average
Income
Vector
Automatic Feature Engineering
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Binning
Numeric Categorical SpatialTemporal
Tokenization
Hash Encoding
TF-IDF
Word2Vec
Sentiment Analysis
Language Detection
Time difference
Time Binning
Time extraction
(day, week, month,
year)
Closeness to major
events
Augment with external
data e.g avg income
Spatial fraudulent
behavior e.g:
impossible travel speed
Geo-encoding
Text
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Smart Binning
LabelCount Encoding
Category Embedding
More…
Automatic Feature Selection
•  Analyze features & calculate statistics
•  Ensure features have acceptable ranges
•  Is this feature a leaker?
•  Does this feature help our model? Is it
predictive?
Automatic Feature Selection
// Sanity check your features against the label!
val checked = price.check(!
featureVector = feats,!
checkSample = 0.3,!
sampleSeed = 1L,!
sampleLimit = 100000L,!
maxCorrelation = 0.95,!
minCorrelation = 0.0,!
correlationType = Pearson,!
minVariance = 0.00001,!
removeBadFeatures = true!
)!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
Automatic Model Selection
•  Multiple algorithms to pick from
•  Many hyperparameters for each algorithm
•  Automated hyperparameter tuning
–  Faster model creation with improved metrics
–  Search algorithms to find the optimal
hyperparameters. e.g. grid search, random
search, bandit methods
Automatic Model Selection
// Model selection and hyperparameter tuning!
val preds =!
RegressionModelSelector!
.withCrossValidation(!
dataSplitter = DataSplitter(reserveTestFraction = 0.1),!
numFolds = 3,!
validationMetric = Evaluators.Regression.rmse(),!
trainTestEvaluators = Seq.empty,!
seed = 1L)!
.setModelsToTry(LinearRegression, RandomForestRegression)!
.setLinearRegressionElasticNetParam(0, 0.5, 1)!
.setLinearRegressionMaxIter(10, 100)!
.setLinearRegressionSolver(Solver.LBFGS)!
.setRandomForestMaxDepth(2, 10)!
.setRandomForestNumTrees(10)!
.setInput(price, checked).getOutput!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
Automatic Model Selection
Demo
How well does it work?
•  Most of our models deployed in production
are completely hands free
•  We serve 475,000,000+ predictions per day
Fantastic ML apps HOWTO
•  Define appropriate level of abstraction
•  Use types to express it
•  Automate everything:
–  feature engineering & selection
–  model selection
–  hyperparameter tuning
–  Etc.
Months -> Hours
Further exploration
Talks @ Scale By The Bay 2017:
•  “Real Time ML Pipelines in Multi-Tenant Environments” by
Karl Skucha and Yan Yang
•  “Fireworks - lighting up the sky with millions of Sparks“ by
Thomas Gerber
•  “Functional Linear Algebra in Scala” by Vlad Patryshev
•  “Complex Machine Learning Pipelines Made Easy” by Chris
Rupley and Till Bergmann
•  “Just enough DevOps for data scientists” by Anya Bida
We are hiring!
einstein-recruiting@salesforce.com
Thank You

More Related Content

Similar to Fantastic ML apps and how to build them

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsDatabricks
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learningIvo Andreev
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Karthik Murugesan
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesBig Data Colombia
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCAndy Butland
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfShiwani Gupta
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 

Similar to Fantastic ML apps and how to build them (20)

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVC
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 

Recently uploaded (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 

Fantastic ML apps and how to build them

  • 1. Matthew Tovbin Principal Engineer, Salesforce Einstein mtovbin@salesforce.com @tovbinm Fantastic ML apps and how to build them
  • 2. “This lonely scene – the galaxies like dust, is what most of space looks like. This emptiness is normal. The richness of our own neighborhood is the exception.” – Powers of Ten (1977), by Charles and Ray Eames
  • 3. Powers of Ten (1977) A travel between a quark and the observable universe [10-17, 1024]
  • 4. Powers of Ten for Machine Learning •  Data collection •  Data preparation •  Feature engineering •  Feature selection •  Sampling •  Algorithm implementation •  Hyperparameter tuning •  Model selection •  Model serving (scoring) •  Prediction insights •  Metrics
  • 5. a)  Hours b)  Days c)  Weeks d)  Months e)  More How long does it take to build a machine learning application?
  • 6. How to cope with this complexity? E = mc2 Free[F[_], A] M[A] Functor[F[_]] Cofree[S[_], A] Months -> Hours
  • 7. “The task of the software development team is to engineer the illusion of simplicity.” – Grady Booch
  • 9. Appropriate Level of Abstraction Language Syntax & Semantics Degrees of Freedom Lower Abstraction Higher Abstraction define •  Less flexible •  Simpler syntax •  Reuse •  Suitable for complex problems •  Difficult to use •  More complex •  Error prone ???
  • 10. “FP removes one important dimension of complexity: To understand a program part (a function) you need no longer account for the possible histories of executions that can lead to that program part.” – Martin Odersky
  • 11. Functional Approach •  Type-safe •  No side effects •  Composability •  Concise •  Fine-grained control // Extracting URL features! def urlFeatures(s: String): (Text, Text) = { ! val url = Url(s)! url.protocol -> url.domain! }! Seq("http://einstein.com", “”).map(urlFeatures)! ! > Seq((Text(“http”), Text(“einstein.com”),! (Text(), Text()))!
  • 12. Object-oriented Approach •  Modularity •  Code reuse •  Polymorphism // Extracting text features! val txt = Seq(! Url("http://einstein.com"),! Base64("b25lIHR3byB0aHJlZQ==”),! Text(”Hello world!”),! Phone(”650-123-4567”)! Text.empty ! )! txt.map(_.tokenize)! ! Seq(! TextList(“http”, “einstein.com”),! TextList(“one”, “two”, “three”),! TextList(“Hello”, “world”),! TextList(“+1”, “650”, “1234567”),! TextList()! )!
  • 13. Why Scala? •  Combines FP & OOP •  Strongly-typed •  Expressive •  Concise •  Fun (mostly) •  Default for Spark
  • 14. Optimus Prime An AutoML library for building modular, reusable, strongly typed ML workflows on Spark •  Declarative & intuitive syntax •  Proper level of abstraction •  Aimed for simplicity & reuse •  >90% accuracy with 100X reduction in time
  • 15. FeatureType OPNumeric OPCollection OPSetOPList NonNullable TextEmail Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime MultiPickList TextMap … TextList City Street Country PostalCode Location State Geolocation StateMap SingleResponse RealNN Categorical MultiResponse Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin Types Hide the Complexity
  • 16. Type Safety Everywhere •  Value Operations •  Feature Operations •  Transformation Pipelines (aka Workflow) // Typed value operations! def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList! ! // Typed feature operations! val title: Feature[Text] = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val tokens: Feature[TextList] = title.map(tokenize)! ! // Transformation pipelines! new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
  • 17. Book Price Prediction // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize()! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  • 18. Magic Behind “vectorize()” // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize() // <- magic here! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  • 19. Automatic Feature Engineering ZipcodeSubjectPhoneEmail Age Age [0-15] Age [15-35] Age [>35] Email Is Spammy Top Email Domains Country Code Phone Is Valid Top TF- IDF Terms Average Income Vector
  • 20. Automatic Feature Engineering Imputation Track null value Log transformation for large range Scaling - zNormalize Smart Binning Numeric Categorical SpatialTemporal Tokenization Hash Encoding TF-IDF Word2Vec Sentiment Analysis Language Detection Time difference Time Binning Time extraction (day, week, month, year) Closeness to major events Augment with external data e.g avg income Spatial fraudulent behavior e.g: impossible travel speed Geo-encoding Text Imputation Track null value One Hot Encoding Dynamic Top K pivot Smart Binning LabelCount Encoding Category Embedding More…
  • 21. Automatic Feature Selection •  Analyze features & calculate statistics •  Ensure features have acceptable ranges •  Is this feature a leaker? •  Does this feature help our model? Is it predictive?
  • 22. Automatic Feature Selection // Sanity check your features against the label! val checked = price.check(! featureVector = feats,! checkSample = 0.3,! sampleSeed = 1L,! sampleLimit = 100000L,! maxCorrelation = 0.95,! minCorrelation = 0.0,! correlationType = Pearson,! minVariance = 0.00001,! removeBadFeatures = true! )! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  • 23. Automatic Model Selection •  Multiple algorithms to pick from •  Many hyperparameters for each algorithm •  Automated hyperparameter tuning –  Faster model creation with improved metrics –  Search algorithms to find the optimal hyperparameters. e.g. grid search, random search, bandit methods
  • 24. Automatic Model Selection // Model selection and hyperparameter tuning! val preds =! RegressionModelSelector! .withCrossValidation(! dataSplitter = DataSplitter(reserveTestFraction = 0.1),! numFolds = 3,! validationMetric = Evaluators.Regression.rmse(),! trainTestEvaluators = Seq.empty,! seed = 1L)! .setModelsToTry(LinearRegression, RandomForestRegression)! .setLinearRegressionElasticNetParam(0, 0.5, 1)! .setLinearRegressionMaxIter(10, 100)! .setLinearRegressionSolver(Solver.LBFGS)! .setRandomForestMaxDepth(2, 10)! .setRandomForestNumTrees(10)! .setInput(price, checked).getOutput! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  • 26. Demo
  • 27. How well does it work? •  Most of our models deployed in production are completely hands free •  We serve 475,000,000+ predictions per day
  • 28. Fantastic ML apps HOWTO •  Define appropriate level of abstraction •  Use types to express it •  Automate everything: –  feature engineering & selection –  model selection –  hyperparameter tuning –  Etc. Months -> Hours
  • 29. Further exploration Talks @ Scale By The Bay 2017: •  “Real Time ML Pipelines in Multi-Tenant Environments” by Karl Skucha and Yan Yang •  “Fireworks - lighting up the sky with millions of Sparks“ by Thomas Gerber •  “Functional Linear Algebra in Scala” by Vlad Patryshev •  “Complex Machine Learning Pipelines Made Easy” by Chris Rupley and Till Bergmann •  “Just enough DevOps for data scientists” by Anya Bida