SlideShare a Scribd company logo
1 of 35
Download to read offline
Implementing AutoML
Techniques at Salesforce Scale
mtovbin@salesforce.com, @tovbinm
Matthew Tovbin, Principal Engineer, Einstein
Forward Looking Statement
Statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties
materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results
expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be
deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other
financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any
statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new
functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our
operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any
litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our
relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our
service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to
larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is
included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent
fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor
Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently
available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based
upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-
looking statements.
“This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of our own neighborhood is the
exception.”
– Powers of Ten (1977), by Charles and Ray Eames
Image credit: NASA
A travel between a quark
and the observable universe
[10-17, 1024]
Powers of Ten (1977)
Image credit: Powers of Ten, Charles and Ray Eames
• Data collection
• Data preparation
• Feature engineering
• Feature selection
• Sampling
• Algorithm implementation
• Hyperparameter tuning
• Model selection
• Model serving (scoring)
• Prediction insights
• Metrics
Powers of Ten for Machine Learning
Image credit: Powers of Ten, Charles and Ray Eames
1. Define requirements & outcomes
2. Collect & explore data
3. Define hypothesis
4. Prepare data
5. Build prototype
6. Verify hypothesis
7. Train & deploy model
Steps for Building Machine Learning Application
Image credit: Bethesda Software
Multi-cloud and Multi-tenant
Make an intelligent guess
How long does it take to build a machine
learning application?
a. Hours
b. Days
c. Weeks
d. Months
e. More
Image credit: NASA
Einstein Prediction Builder
• Product: Point.
Click. Predict.
• Engineering: any
customer can
create any number
of ML applications
on any data, eh?!
How to cope with this complexity?
E = mc2
Free[F[_], A]
M[A]
Functor[F[_]]
Cofree[S[_], A]
Months -> Hours
“The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch
Image credit: Object-Oriented Analysis and Design with Applications, by Grady Booch
Complexity vs Abstraction
Image credit: Wikipedia
Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define
higher
low
er
• Less flexible
• Simpler syntax
• Reuse
• Suitable for
complex problems
• Difficult to use
• More complex
• Error prone
???
Codename: Optimus Prime
An AutoML library for building modular, reusable, strongly
typed ML workflows on Spark
• Declarative & intuitive syntax
• Proper level of abstraction
• Aimed for simplicity & reuse
• >90% accuracy with 100X reduction
in time
Types Hide the Complexity
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullable
Text
Email
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
TextListCity
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, regular - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
...
CityMap
https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm
Type Safety Everywhere
• Value operations
• Feature operations
• Transformation Pipelines (aka Workflows)
// Typed value operations
def tokenize(t: Text): TextList = t.map(_.split(" ")).toTextList
// Typed feature operations
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val tokens: Feature[TextList] = title.map(tokenize)
// Transformation pipelines
new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())
Book Price Prediction
// Raw feature definitions: author, title, description, price
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse
// Feature engineering: tokenize, tf-idf, vectorize, sanity check
val tokns = (title + description).tokenize(removePunctuation = true)
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)
val feats = Seq(tfidf, author).transmogrify()
val chckd = feats.sanityCheck(price, removeBadFeatures = true)
// Model training: spin up spark, load books.csv, define model -> train
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate
val books = spark.read.csv("books.csv").as[Book]
val preds = RegressionModelSelector().setInput(price, chckd).getOutput
new OpWorkflow().setInput(books).setResultFeatures(chckd, preds)
.withRawFeatureFilter().train()
Book Price Prediction
// Raw feature definitions: author, title, description, price
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse
// Feature engineering: tokenize, tf-idf, vectorize, sanity check
val tokns = (title + description).tokenize(removePunctuation = true)
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)
val feats = Seq(tfidf, author).transmogrify() // <- magic here
val chckd = feats.sanityCheck(price, removeBadFeatures = true)
// Model training: spin up spark, load books.csv, define model -> train
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate
val books = spark.read.csv("books.csv").as[Book]
val preds = RegressionModelSelector().setInput(price, chckd).getOutput
new OpWorkflow().setInput(books).setResultFeatures(chckd, preds)
.withRawFeatureFilter().train()
Workflow
Book Price Prediction
BookBookBooks
Author
Title
Description
Price
FeatureVector
Transmogrify
SanityChecker
RawFeatureFilter
Regression Model
Selector
Best Model
Evaluator Cross
Validation
+ Tokenize
Pivot
TF IDF
Vectors
Combiner
AutoML with Optimus Prime
• Automated Feature Engineering
• Automated Feature Selection
• Automated Model Selection
Automatic Feature Engineering
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone
Is Valid
Top TF-
IDF Terms
Average
Income
Feature Vector
Automatic Feature Engineering
Numeric Categorical SpatialTemporal
Augment with
external data e.g avg
income
Spatial fraudulent
behavior e.g:
impossible travel
speed
Geo-encoding
Text
Time difference
Circular Statistics
Time extraction (day,
week, month, year)
Closeness to major
events
Tokenization
Hash Encoding
TF-IDF
Word2Vec
Sentiment Analysis
Language Detection
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Smart Binning
LabelCount Encoding
Category Embedding
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Binning
Metadata!
● The name of the feature
the column was made
from
● The name of the RAW
feature(s) the column was
made from
● Everything you did to get
the column
● Any grouping information
across columns
● Description of the value in
the column
Image credit: https://ontotext.com/knowledgehub/fundamentals/metadata-fundamental
Automatic Feature Selection
Analyze features & calculate statistics
• Min, max, mean, variance
• Correlations with label
• Contingency matrix
• Confidence & support
• Mutual information against label
• Point-wise mutual information against all label values
• Cramér’s V
• Etc.
Image credit: http://csbiology.com/blog/statistics-for-biologists-nature-com
Automatic Feature Selection
• Ensure features have acceptable ranges
• Is this feature a leaker?
• Does this feature help our model?
• Is it predictive?
Image credit: ://srcity.org/2252/Find-Fix-Leaks
Automatic Feature Selection
// Sanity check your features against the label
val checked = price.sanityCheck(
featureVector = feats,
checkSample = 0.3,
sampleSeed = 42L,
sampleLimit = 100000L,
maxCorrelation = 0.95,
minCorrelation = 0.0,
correlationType = Pearson,
minVariance = 0.00001,
removeBadFeatures = true
)
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()
Automatic Model Selection
• Multiple algorithms to pick from
• Many hyperparameters for each algorithm
• Automated hyperparameter tuning
• Faster model creation with improved
metrics
• Search algorithms to find the optimal
hyperparameters. e.g. grid search, random
search, bandit methods
Automatic Model Selection
// Model selection and hyperparameter tuning
val preds =
RegressionModelSelector
.withCrossValidation(
dataSplitter = DataSplitter(reserveTestFraction = 0.1),
numFolds = 3,
validationMetric = Evaluators.Regression.rmse(),
trainTestEvaluators = Seq.empty,
seed = 42L)
.setModelsToTry(LinearRegression, RandomForestRegression)
.setLinearRegressionElasticNetParam(0, 0.5, 1)
.setLinearRegressionMaxIter(10, 100)
.setLinearRegressionSolver(Solver.LBFGS)
.setRandomForestMaxDepth(2, 10)
.setRandomForestNumTrees(10)
.setInput(price, checked).getOutput
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()
Automatic Model Selection
Demo
Image credit: Wikipedia
How well does it work?
• Most of our models deployed in
production are completely hands free
• 2B+ predictions per day
• Define appropriate level of abstraction
• Use types to express it
• Automate everything
• feature engineering & selection
• model selection & evaluation
• Etc.
Takeaway
Months -> Hours
Join Optimus Prime pilot today!
• Open for pilot
• In production at Salesforce
• Piloted by large tech companies
• 100% Scala
https://sfdc.co/join-op-pilot
Talks for Further Exploration
• “When all the world’s data scientists are just not enough”, Shubha Nabar,
@Scale 17
• “Low touch Machine Learning”, Leah McGuire, Spark Summit 17
• “Fantastic ML apps and how to build them”, Matthew Tovbin, Scale By The
Bay 17
• “The Black Swan of perfectly interpretable models”, Leah McGuire and
Mayukh Bhaowal, QCon AI 18
Implement AutoML at Salesforce Scale with Optimus Prime

More Related Content

Similar to Implement AutoML at Salesforce Scale with Optimus Prime

Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStoreDeveloping Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStoreTom Gersic
 
Design Patterns Every ISV Needs to Know (October 15, 2014)
Design Patterns Every ISV Needs to Know (October 15, 2014)Design Patterns Every ISV Needs to Know (October 15, 2014)
Design Patterns Every ISV Needs to Know (October 15, 2014)Salesforce Partners
 
Apex Nuances: Transitioning to Force.com Development
Apex Nuances: Transitioning to Force.com DevelopmentApex Nuances: Transitioning to Force.com Development
Apex Nuances: Transitioning to Force.com DevelopmentSalesforce Developers
 
Simplify Complex Data with Dynamic Image Formulas by Chandler Anderson
Simplify Complex Data with Dynamic Image Formulas by Chandler AndersonSimplify Complex Data with Dynamic Image Formulas by Chandler Anderson
Simplify Complex Data with Dynamic Image Formulas by Chandler AndersonSalesforce Admins
 
Eland: A Python client for data analysis and exploration
Eland: A Python client for data analysis and explorationEland: A Python client for data analysis and exploration
Eland: A Python client for data analysis and explorationElasticsearch
 
Quickly Create Data Sets for the Analytics Cloud
Quickly Create Data Sets for the Analytics CloudQuickly Create Data Sets for the Analytics Cloud
Quickly Create Data Sets for the Analytics CloudSalesforce Developers
 
Using AI for Providing Insights and Recommendations on Activity Data Alexis R...
Using AI for Providing Insights and Recommendations on Activity Data Alexis R...Using AI for Providing Insights and Recommendations on Activity Data Alexis R...
Using AI for Providing Insights and Recommendations on Activity Data Alexis R...Databricks
 
Making External Web Pages Interact With Visualforce
Making External Web Pages Interact With VisualforceMaking External Web Pages Interact With Visualforce
Making External Web Pages Interact With VisualforceSalesforce Developers
 
Designing an Enterprise CSS Framework is Hard, Stephanie Rewis
Designing an Enterprise CSS Framework is Hard, Stephanie RewisDesigning an Enterprise CSS Framework is Hard, Stephanie Rewis
Designing an Enterprise CSS Framework is Hard, Stephanie RewisFuture Insights
 
Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...
Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...
Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...Shell Black
 
Building einstein analytics apps uk-compressed
Building einstein analytics apps   uk-compressedBuilding einstein analytics apps   uk-compressed
Building einstein analytics apps uk-compressedrikkehovgaard
 
Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...
Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...
Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...Salesforce Developers
 
Best Practices for Lightning Apps
Best Practices for Lightning AppsBest Practices for Lightning Apps
Best Practices for Lightning AppsMark Adcock
 
JavaScript Patterns and Practices from the Salesforce Experts
JavaScript Patterns and Practices from the Salesforce ExpertsJavaScript Patterns and Practices from the Salesforce Experts
JavaScript Patterns and Practices from the Salesforce ExpertsSalesforce Developers
 
Mds cloud saturday 2015 salesforce intro
Mds cloud saturday 2015 salesforce introMds cloud saturday 2015 salesforce intro
Mds cloud saturday 2015 salesforce introDavid Scruggs
 
Making Your Apex and Visualforce Reusable
Making Your Apex and Visualforce ReusableMaking Your Apex and Visualforce Reusable
Making Your Apex and Visualforce ReusableSalesforce Developers
 

Similar to Implement AutoML at Salesforce Scale with Optimus Prime (20)

From Force.com to Heroku and Back
From Force.com to Heroku and BackFrom Force.com to Heroku and Back
From Force.com to Heroku and Back
 
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStoreDeveloping Offline Mobile Apps with Salesforce Mobile SDK SmartStore
Developing Offline Mobile Apps with Salesforce Mobile SDK SmartStore
 
Design Patterns Every ISV Needs to Know (October 15, 2014)
Design Patterns Every ISV Needs to Know (October 15, 2014)Design Patterns Every ISV Needs to Know (October 15, 2014)
Design Patterns Every ISV Needs to Know (October 15, 2014)
 
Apex Nuances: Transitioning to Force.com Development
Apex Nuances: Transitioning to Force.com DevelopmentApex Nuances: Transitioning to Force.com Development
Apex Nuances: Transitioning to Force.com Development
 
Beyond Custom Metadata Types
Beyond Custom Metadata TypesBeyond Custom Metadata Types
Beyond Custom Metadata Types
 
Simplify Complex Data with Dynamic Image Formulas by Chandler Anderson
Simplify Complex Data with Dynamic Image Formulas by Chandler AndersonSimplify Complex Data with Dynamic Image Formulas by Chandler Anderson
Simplify Complex Data with Dynamic Image Formulas by Chandler Anderson
 
Eland: A Python client for data analysis and exploration
Eland: A Python client for data analysis and explorationEland: A Python client for data analysis and exploration
Eland: A Python client for data analysis and exploration
 
Quickly Create Data Sets for the Analytics Cloud
Quickly Create Data Sets for the Analytics CloudQuickly Create Data Sets for the Analytics Cloud
Quickly Create Data Sets for the Analytics Cloud
 
Using AI for Providing Insights and Recommendations on Activity Data Alexis R...
Using AI for Providing Insights and Recommendations on Activity Data Alexis R...Using AI for Providing Insights and Recommendations on Activity Data Alexis R...
Using AI for Providing Insights and Recommendations on Activity Data Alexis R...
 
Making External Web Pages Interact With Visualforce
Making External Web Pages Interact With VisualforceMaking External Web Pages Interact With Visualforce
Making External Web Pages Interact With Visualforce
 
Designing an Enterprise CSS Framework is Hard, Stephanie Rewis
Designing an Enterprise CSS Framework is Hard, Stephanie RewisDesigning an Enterprise CSS Framework is Hard, Stephanie Rewis
Designing an Enterprise CSS Framework is Hard, Stephanie Rewis
 
Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...
Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...
Transition to the Lightning Experience: Pro Tips, Tools and a Transition Stra...
 
Building einstein analytics apps uk-compressed
Building einstein analytics apps   uk-compressedBuilding einstein analytics apps   uk-compressed
Building einstein analytics apps uk-compressed
 
Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...
Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...
Hands-On Workshop: Introduction to Coding for on Force.com for Admins and Non...
 
Lightning Components: The Future
Lightning Components: The FutureLightning Components: The Future
Lightning Components: The Future
 
Dev ops.enterprise.2014 (1)
Dev ops.enterprise.2014 (1)Dev ops.enterprise.2014 (1)
Dev ops.enterprise.2014 (1)
 
Best Practices for Lightning Apps
Best Practices for Lightning AppsBest Practices for Lightning Apps
Best Practices for Lightning Apps
 
JavaScript Patterns and Practices from the Salesforce Experts
JavaScript Patterns and Practices from the Salesforce ExpertsJavaScript Patterns and Practices from the Salesforce Experts
JavaScript Patterns and Practices from the Salesforce Experts
 
Mds cloud saturday 2015 salesforce intro
Mds cloud saturday 2015 salesforce introMds cloud saturday 2015 salesforce intro
Mds cloud saturday 2015 salesforce intro
 
Making Your Apex and Visualforce Reusable
Making Your Apex and Visualforce ReusableMaking Your Apex and Visualforce Reusable
Making Your Apex and Visualforce Reusable
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Recently uploaded (20)

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

Implement AutoML at Salesforce Scale with Optimus Prime

  • 1. Implementing AutoML Techniques at Salesforce Scale mtovbin@salesforce.com, @tovbinm Matthew Tovbin, Principal Engineer, Einstein
  • 2. Forward Looking Statement Statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward- looking statements.
  • 3. “This lonely scene – the galaxies like dust, is what most of space looks like. This emptiness is normal. The richness of our own neighborhood is the exception.” – Powers of Ten (1977), by Charles and Ray Eames Image credit: NASA
  • 4. A travel between a quark and the observable universe [10-17, 1024] Powers of Ten (1977) Image credit: Powers of Ten, Charles and Ray Eames
  • 5. • Data collection • Data preparation • Feature engineering • Feature selection • Sampling • Algorithm implementation • Hyperparameter tuning • Model selection • Model serving (scoring) • Prediction insights • Metrics Powers of Ten for Machine Learning Image credit: Powers of Ten, Charles and Ray Eames
  • 6. 1. Define requirements & outcomes 2. Collect & explore data 3. Define hypothesis 4. Prepare data 5. Build prototype 6. Verify hypothesis 7. Train & deploy model Steps for Building Machine Learning Application Image credit: Bethesda Software
  • 8. Make an intelligent guess How long does it take to build a machine learning application? a. Hours b. Days c. Weeks d. Months e. More Image credit: NASA
  • 9. Einstein Prediction Builder • Product: Point. Click. Predict. • Engineering: any customer can create any number of ML applications on any data, eh?!
  • 10. How to cope with this complexity? E = mc2 Free[F[_], A] M[A] Functor[F[_]] Cofree[S[_], A] Months -> Hours
  • 11. “The task of the software development team is to engineer the illusion of simplicity.” – Grady Booch Image credit: Object-Oriented Analysis and Design with Applications, by Grady Booch
  • 12. Complexity vs Abstraction Image credit: Wikipedia
  • 13. Appropriate Level of Abstraction Language Syntax & Semantics Degrees of Freedom Lower Abstraction Higher Abstraction define higher low er • Less flexible • Simpler syntax • Reuse • Suitable for complex problems • Difficult to use • More complex • Error prone ???
  • 14. Codename: Optimus Prime An AutoML library for building modular, reusable, strongly typed ML workflows on Spark • Declarative & intuitive syntax • Proper level of abstraction • Aimed for simplicity & reuse • >90% accuracy with 100X reduction in time
  • 15. Types Hide the Complexity FeatureType OPNumeric OPCollection OPSetOPList NonNullable Text Email Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime MultiPickList TextMap TextListCity Street Country PostalCode Location State Geolocation StateMap SingleResponse RealNN Categorical MultiResponse Legend: bold - abstract type, regular - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin ... CityMap https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm
  • 16. Type Safety Everywhere • Value operations • Feature operations • Transformation Pipelines (aka Workflows) // Typed value operations def tokenize(t: Text): TextList = t.map(_.split(" ")).toTextList // Typed feature operations val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val tokens: Feature[TextList] = title.map(tokenize) // Transformation pipelines new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())
  • 17. Book Price Prediction // Raw feature definitions: author, title, description, price val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse // Feature engineering: tokenize, tf-idf, vectorize, sanity check val tokns = (title + description).tokenize(removePunctuation = true) val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0) val feats = Seq(tfidf, author).transmogrify() val chckd = feats.sanityCheck(price, removeBadFeatures = true) // Model training: spin up spark, load books.csv, define model -> train implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate val books = spark.read.csv("books.csv").as[Book] val preds = RegressionModelSelector().setInput(price, chckd).getOutput new OpWorkflow().setInput(books).setResultFeatures(chckd, preds) .withRawFeatureFilter().train()
  • 18. Book Price Prediction // Raw feature definitions: author, title, description, price val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse // Feature engineering: tokenize, tf-idf, vectorize, sanity check val tokns = (title + description).tokenize(removePunctuation = true) val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0) val feats = Seq(tfidf, author).transmogrify() // <- magic here val chckd = feats.sanityCheck(price, removeBadFeatures = true) // Model training: spin up spark, load books.csv, define model -> train implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate val books = spark.read.csv("books.csv").as[Book] val preds = RegressionModelSelector().setInput(price, chckd).getOutput new OpWorkflow().setInput(books).setResultFeatures(chckd, preds) .withRawFeatureFilter().train()
  • 19. Workflow Book Price Prediction BookBookBooks Author Title Description Price FeatureVector Transmogrify SanityChecker RawFeatureFilter Regression Model Selector Best Model Evaluator Cross Validation + Tokenize Pivot TF IDF Vectors Combiner
  • 20. AutoML with Optimus Prime • Automated Feature Engineering • Automated Feature Selection • Automated Model Selection
  • 21. Automatic Feature Engineering ZipcodeSubjectPhoneEmail Age Age [0-15] Age [15-35] Age [>35] Email Is Spammy Top Email Domains Country Code Phone Is Valid Top TF- IDF Terms Average Income Feature Vector
  • 22. Automatic Feature Engineering Numeric Categorical SpatialTemporal Augment with external data e.g avg income Spatial fraudulent behavior e.g: impossible travel speed Geo-encoding Text Time difference Circular Statistics Time extraction (day, week, month, year) Closeness to major events Tokenization Hash Encoding TF-IDF Word2Vec Sentiment Analysis Language Detection Imputation Track null value One Hot Encoding Dynamic Top K pivot Smart Binning LabelCount Encoding Category Embedding Imputation Track null value Log transformation for large range Scaling - zNormalize Smart Binning
  • 23. Metadata! ● The name of the feature the column was made from ● The name of the RAW feature(s) the column was made from ● Everything you did to get the column ● Any grouping information across columns ● Description of the value in the column Image credit: https://ontotext.com/knowledgehub/fundamentals/metadata-fundamental
  • 24. Automatic Feature Selection Analyze features & calculate statistics • Min, max, mean, variance • Correlations with label • Contingency matrix • Confidence & support • Mutual information against label • Point-wise mutual information against all label values • Cramér’s V • Etc. Image credit: http://csbiology.com/blog/statistics-for-biologists-nature-com
  • 25. Automatic Feature Selection • Ensure features have acceptable ranges • Is this feature a leaker? • Does this feature help our model? • Is it predictive? Image credit: ://srcity.org/2252/Find-Fix-Leaks
  • 26. Automatic Feature Selection // Sanity check your features against the label val checked = price.sanityCheck( featureVector = feats, checkSample = 0.3, sampleSeed = 42L, sampleLimit = 100000L, maxCorrelation = 0.95, minCorrelation = 0.0, correlationType = Pearson, minVariance = 0.00001, removeBadFeatures = true ) new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()
  • 27. Automatic Model Selection • Multiple algorithms to pick from • Many hyperparameters for each algorithm • Automated hyperparameter tuning • Faster model creation with improved metrics • Search algorithms to find the optimal hyperparameters. e.g. grid search, random search, bandit methods
  • 28. Automatic Model Selection // Model selection and hyperparameter tuning val preds = RegressionModelSelector .withCrossValidation( dataSplitter = DataSplitter(reserveTestFraction = 0.1), numFolds = 3, validationMetric = Evaluators.Regression.rmse(), trainTestEvaluators = Seq.empty, seed = 42L) .setModelsToTry(LinearRegression, RandomForestRegression) .setLinearRegressionElasticNetParam(0, 0.5, 1) .setLinearRegressionMaxIter(10, 100) .setLinearRegressionSolver(Solver.LBFGS) .setRandomForestMaxDepth(2, 10) .setRandomForestNumTrees(10) .setInput(price, checked).getOutput new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()
  • 31. How well does it work? • Most of our models deployed in production are completely hands free • 2B+ predictions per day
  • 32. • Define appropriate level of abstraction • Use types to express it • Automate everything • feature engineering & selection • model selection & evaluation • Etc. Takeaway Months -> Hours
  • 33. Join Optimus Prime pilot today! • Open for pilot • In production at Salesforce • Piloted by large tech companies • 100% Scala https://sfdc.co/join-op-pilot
  • 34. Talks for Further Exploration • “When all the world’s data scientists are just not enough”, Shubha Nabar, @Scale 17 • “Low touch Machine Learning”, Leah McGuire, Spark Summit 17 • “Fantastic ML apps and how to build them”, Matthew Tovbin, Scale By The Bay 17 • “The Black Swan of perfectly interpretable models”, Leah McGuire and Mayukh Bhaowal, QCon AI 18