SlideShare a Scribd company logo
1 of 35
Productionizing
Spark ML Pipelines
with PFA
Nick Pentreath
Principal Engineer
@MLnick
DBG / June 5, 2018 / © 2018 IBM Corporation
About
@MLnick on Twitter & Github
Principal Engineer, IBM
CODAIT - Center for Open-Source Data &
AI Technologies
Machine Learning & AI
Apache Spark committer & PMC
Author of Machine Learning with Spark
Various conferences & meetups
DBG / June 5, 2018 / © 2018 IBM Corporation
Center for Open Source Data
and AI Technologies
CODAIT
codait.org
DBG / June 5, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create, deploy,
and manage in the enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open Source
Gather
Data
Analyze
Data
Machine
Learning
Deep
Learning
Deploy
Model
Maintain
Model
Python
Data Science
Stack
Fabric for
Deep Learning
(FfDL)
Mleap +
PFA
Scikit-LearnPandas
Apache
Spark
Apache
Spark
Jupyter
Model
Asset
eXchange
Keras +
Tensorflow
Agenda
The Machine Learning Workflow
Challenges of ML Deployment
Portable Format for Analytics
PFA for Spark ML
Performance Comparisons
Summary and Future Directions
DBG / June 5, 2018 / © 2018 IBM Corporation
The Machine Learning
Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
Perception
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
In reality the workflow spans teams …
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
… and tools …
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
… and is a small (but critical!)
piece of the puzzle
The Machine Learning Workflow
DBG / June 5, 2018 / © 2018 IBM Corporation
*Source: Hidden Technical Debt in Machine Learning Systems
Machine Learning
Deployment
DBG / June 5, 2018 / © 2018 IBM Corporation
Challenges
DBG / June 5, 2018 / © 2018 IBM Corporation
• Proliferation of formats
• Lack of standardization => custom solutions
• Existing standards have limitations
Machine Learning Deployment
• Need to manage and bridge many different:
• Languages - Python, R, Notebooks, Scala / Java / C
• Frameworks – too many to count!
• Dependencies
• Versions
• Performance characteristics highly variable
across these dimensions
• Friction between teams
• Data scientists & researchers – latest & greatest
• Production – stability, control, performance
• Business – metrics, business impact, working product!
Spark solves many problems …
Machine Learning Deployment
DBG / June 5, 2018 / © 2018 IBM Corporation
… but introduces additional challenges
DBG / June 5, 2018 / © 2018 IBM Corporation
• Currently, in order to use trained models
outside of Spark, users must:
• Write custom readers for Spark’s native format; or
• Create their own custom format; or
• Export to a standard format (limited support => custom
solution)
• Scoring outside of Spark also requires custom
translation layer between Spark and another
ML library
Everything is custom!
Machine Learning Deployment
• Tight coupling to Spark runtime
• Introduces complex dependencies
• Managing version & compatibility issues
• Scoring models in Spark is slow
• Overhead of DataFrames, especially query planning
• Overhead of task scheduling, even locally
• Optimized for batch scoring (includes streaming “micro-
batch” settings)
• Spark is not suitable for real-time scoring (<
few 100ms latency)
The Portable Format for
Analytics
DBG / June 5, 2018 / © 2018 IBM Corporation
Overview
DBG / June 5, 2018 / © 2018 IBM Corporation
• PFA is being championed by the Data Mining
Group (IBM is a founding member)
• DMG previously created PMML (Predictive
Model Markup Language), arguably the only
viable open standard currently
• PMML has many limitations
• PFA was created specifically to address these
shortcomings
• PFA consists of:
• JSON serialization format
• AVRO schemas for data types
• Encodes functions (actions) that are applied to inputs to
create outputs with a set of built-in functions and
language constructs (e.g. control-flow, conditionals)
• Essentially a mini functional math language + schema
specification
• Type and function system means PFA can be
fully & statically verified on load and run by
any compliant execution engine
• => portability across languages, frameworks,
run times and versions
Portable Format for Analytics
A Simple Example
DBG / June 5, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Example – multi-class logistic regression
• Specify input and output types using Avro
schemas
• Specify the action to perform (typically on
input)
Managing State
DBG / June 5, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Data storage specified by cells
• A cell is a named value acting as a global variable
• Typically used to store state (such as model
coefficients, vocabulary mappings, etc)
• Types specified with Avro schemas
• Cell values are mutable within an action, but immutable
between action executions of a given PFA document
• Persistent storage specified by pools
• Closer in concept to a database
• Pools values are mutable across action executions
Other Features
DBG / June 5, 2018 / © 2018 IBM Corporation
Portable Format for Analytics
• Special forms
• Control structure – conditionals & loops
• Creating and manipulating local variables
• User-defined functions including lambdas
• Casts
• Null checks
• (Very) basic try-catch, user-defined errors and logs
• Comprehensive built-in function library
• Math, strings, arrays, maps, stats, linear algebra
• Built-in support for some common models - decision
tree, clustering, linear models
Aardpfark
DBG / June 5, 2018 / © 2018 IBM Corporation
PFA and Spark ML
• PFA export for Spark ML pipelines
• aardpfark-core: Scala DSL for creating PFA
documents
• avro4s to generate schemas from case classes;
json4s to serialize PFA document to JSON
• aardpfark-sparkml: uses DSL to export Spark ML
components and pipelines to PFA
Aardpfark - Challenges
DBG / June 5, 2018 / © 2018 IBM Corporation
PFA and Spark ML
• Spark ML Model has no schema knowledge
• E.g. Binarizer can operate on numeric or vector
columns
• Need to use Avro union types for standalone PFA
components and handle all cases in the action logic
• Combining components into a pipeline
• Trying to match Spark’s DataFrame-based input/output
behavior (typically appending columns)
• Each component is wrapped as a user-defined function
in the PFA document
• Current approach mimics passing a Row (i.e. Avro
record) from function to function, adding fields
• Missing features in PFA
• Generic vector support (mixed dense/sparse)
Related Open Standards
DBG / June 5, 2018 / © 2018 IBM Corporation
PMML
DBG / June 5, 2018 / © 2018 IBM Corporation
Related Open Standards
• Data Mining Group (DMG)
• Model interchange format in XML with
operators
• Widely used and supported; open standard
• Spark support lacking natively but 3rd party
projects available: jpmml-sparkml
• Comprehensive support for Spark ML components
(perhaps surprisingly!)
• Watch SPARK-11237
• Other exporters include scikit-learn, R,
XGBoost and LightGBM
• Shortcomings
• Cannot represent arbitrary programs / analytic
applications
• Flexibility comes from custom plugins => lose
benefits of standardization
MLeap
DBG / June 5, 2018 / © 2018 IBM Corporation
Related Open Standards
• Created by Combust.ML, a startup focused on
ML model serving
• Model interchange format in JSON / Protobuf
• Components implemented in Scala code
• Good performance
• Initially focused on Spark ML. Offers almost
complete support for Spark ML components
• Recently added some sklearn; working on
TensorFlow
• Shortcomings
• “Open” format, but not a “standard”
• No concept of well-defined operators / functions
• Effectively forces a tight coupling between
versions of model producer / consumer
• Must implement custom components in Scala
• Impossible to statically verify serialized model
Open Neural Network Exchange (ONNX)
DBG / June 5, 2018 / © 2018 IBM Corporation
• Championed by Facebook & Microsoft
• Protobuf serialization format
• Describes computation graph (including
operators)
• In this way the serialized graph is “self-describing”
similarly to PFA
• More focused on Deep Learning / tensor
operations
• Will be baked into PyTorch 1.0.0 / Caffe2 as
the serialization & interchange format
• Shortcomings
• No or poor support for more “traditional” ML or
language constructs (currently)
• Tree-based models & ensembles
• String / categorical processing
• Control flow
• Intermediate variables
Related Open Standards
Performance
DBG / June 5, 2018 / © 2018 IBM Corporation
Scoring Performance Comparison
DBG / June 5, 2018 / © 2018 IBM Corporation
• Comparing scoring performance of PFA with
Spark and MLeap
• PFA uses Hadrian reference implementation
for JVM
• Test dataset of ~80,000 records
• String indexing of 47 categorical columns
• Vector assembling the 47 categorical indices together
with 27 numerical columns
• Linear regression predictor
• Note: Spark time is 1.9s / record (1901ms) -
not shown on the chart
Performance
Conclusion
DBG / June 5, 2018 / © 2018 IBM Corporation
Summary
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
• PFA provides an open standard for
serialization and deployment of analytic
workflows
• True portability across languages, frameworks,
runtimes and versions
• Execution environment is independent of the producer
(R, scikit-learn, Spark ML, weka, etc)
• Solves a significant pain point for the
deployment of Spark ML pipelines
• Also benefits the wider ecosystem
• e.g. many currently use PMML for exporting models
from R, scikit-learn, XGBoost, LightGBM, etc.
• However there are risks
• PFA is still young and needs to gain adoption
• Performance in production, at scale, is relatively
untested
• Tests indicate PFA reference engines need some work
on robustness and performance
• What about Deep Learning / comparison to ONNX?
• Limitations of PFA
• A standard can move slowly in terms of new features,
fixes and enhancements
Open-sourcing Aardpfark
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
• Coverage
• Almost all predictors (ML models)
• Many feature transformers
• Pipeline support (still needs work)
• Equivalence tests Spark <-> PFA
• Tests for core Scala DSL
https://github.com/CODAIT/aardpfark
• Need your help!
• Finish implementing components
• Improve pipeline support
• Complete Scala DSL for PFA
• Python support
• Tests, docs, testing it out!
Future directions
DBG / June 5, 2018 / © 2018 IBM Corporation
Conclusion
• Initially focused on Spark ML pipelines
• Later add support for scikit-learn pipelines, XGBoost,
LightGBM, etc
• (Support for many R models exist already in the
Hadrian project)
• Further performance testing in progress vs
Spark & MLeap; also test vs PMML
• More automated translation (Scala -> PFA,
ASTs etc)
• Propose improvements to PFA
• Generic vector (tensor) support
• Less cumbersome schema definitions
• Performance improvements to scoring engine
• PFA for Deep Learning?
• Comparing to ONNX and other emerging standards
• Better suited for the more general pre-processing steps
of DL pipelines
• Requires all the various DL-specific operators
• Requires tensor schema and better tensor support
built-in to the PFA spec
• Should have GPU support
Thank you!
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com/code
DBG / June 5, 2018 / © 2018 IBM Corporation
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://datascience.ibm.com/
MAX
Links & References
Portable Format for Analytics
PMML
Spark MLlib – Saving and Loading Pipelines
Hadrian – Reference Implementation of PFA Engines for
JVM, Python, R
jpmml-sparkml
MLeap
Open Neural Network Exchange
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Tue, June 5 | 11 AM
2010-2012, 30 mins
Productionizing Spark ML Pipelines with the Portable Format for Analytics
Nick Pentreath (IBM)
Tue, June 5 | 2 PM
2018, 30 mins
Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing!
Holden Karau (Google) Bryan Cutler (IBM)
Tue, June 5 | 2 PM
Nook by 2001, 30 mins
Making Data and AI Accessible for All
Armand Ruiz Gabernet (IBM)
Tue, June 5 | 2:40 PM
2002-2004, 30 mins
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System
Rajesh Bordawekar (IBM T.J. Watson Research Center)
Tue, June 5 | 3:20 PM
3016-3022, 30 mins
Dynamic Priorities for Apache Spark Application’s Resource Allocations
Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.)
Tue, June 5 | 3:20 PM
2001-2005, 30 mins
Model Parallelism in Spark ML Cross-Validation
Nick Pentreath (IBM) Bryan Cutler (IBM)
Tue, June 5 | 3:20 PM
2007, 30 mins
Serverless Machine Learning on Modern Hardware Using Apache Spark
Patrick Stuedi (IBM)
Tue, June 5 | 5:40 PM
2002-2004, 30 mins
Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine;
Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida)
Tue, June 5 | 5:40 PM
2007, 30 mins
Transparent GPU Exploitation on Apache Spark
Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM)
Tue, June 5 | 5:40 PM
2009-2011, 30 mins
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks
Yonggang Hu (IBM) Chao Xue (IBM)
IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5)
DBG / June 5, 2018 / © 2018 IBM Corporation
Date, Time, Location & Duration Session title and Speaker
Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More
Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP)
Wed, June 6 | 2 PM
2002-2004, 30 mins
Deep Learning for Recommender Systems
Nick Pentreath (IBM) )
Wed, June 6 | 3:20 PM
2018, 30 mins
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer
Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies)
IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6)
DBG / June 5, 2018 / © 2018 IBM Corporation
Meet us at IBM booth in the Expo area.
DBG / June 5, 2018 / © 2018 IBM Corporation

More Related Content

What's hot

Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningLviv Startup Club
 
How to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysqlHow to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysqlBryan Downing
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Databricks
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Group, Inc.
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
 
Pentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare RuleenginePentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare Ruleengineuwe geercken
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...Flink Forward
 
Next18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to youNext18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to youMárton Kodok
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...Databricks
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardITCamp
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Flink Forward
 

What's hot (20)

Image Management at Ebates
Image Management at EbatesImage Management at Ebates
Image Management at Ebates
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
 
How to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysqlHow to build high frequency trading with our matlab secrets with c++ and mysql
How to build high frequency trading with our matlab secrets with c++ and mysql
 
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
Model Parallelism in Spark ML Cross-Validation with Nick Pentreath and Bryan ...
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Pentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare RuleenginePentaho PDI and the Jare Ruleengine
Pentaho PDI and the Jare Ruleengine
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
 
The HANA Cloud Platform
The HANA Cloud PlatformThe HANA Cloud Platform
The HANA Cloud Platform
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
Flink for Everyone: Self Service Data Analytics with StreamPipes - Philipp Ze...
 
Next18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to youNext18 Extended Targu Mures - Bringing the Cloud to you
Next18 Extended Targu Mures - Bringing the Cloud to you
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
 
Knime &amp; bioinformatics
Knime &amp; bioinformaticsKnime &amp; bioinformatics
Knime &amp; bioinformatics
 
Big Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David GiardBig Data Solutions in Azure - David Giard
Big Data Solutions in Azure - David Giard
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
 

Similar to Productionizing Spark ML Pipelines with PFA

Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Databricks
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsDataWorks Summit
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathChester Chen
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesNick Pentreath
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3DataWorks Summit
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeNick Pentreath
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesDatabricks
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...Amazon Web Services
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
iSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java MigrationiSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java Migrationecubemarketing
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance ObservationsAdam Roberts
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...Jean Ihm
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016Tim Ellison
 
FDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real StoryFDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real StoryJoseph Alaimo Jr
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Amazon Web Services
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0Charles Rivet
 

Similar to Productionizing Spark ML Pipelines with PFA (20)

Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
Productionizing Spark ML Pipelines with the Portable Format for Analytics wit...
 
Productionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analyticsProductionizing Spark ML pipelines with the portable format for analytics
Productionizing Spark ML pipelines with the portable format for analytics
 
Index conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreathIndex conf sparkai-feb20-n-pentreath
Index conf sparkai-feb20-n-pentreath
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
ML Best Practices: Prepare Data, Build Models, and Manage Lifecycle (AIM396-S...
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
iSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java MigrationiSeries Modernization: RPG/400 to Java Migration
iSeries Modernization: RPG/400 to Java Migration
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
How To Model and Construct Graphs with Oracle Database (AskTOM Office Hours p...
 
Apache Big Data Europe 2016
Apache Big Data Europe 2016Apache Big Data Europe 2016
Apache Big Data Europe 2016
 
FDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real StoryFDMEE versus Cloud Data Management - The Real Story
FDMEE versus Cloud Data Management - The Real Story
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0Papyrus for System Engineering - Papyrus for Real Time v1.0
Papyrus for System Engineering - Papyrus for Real Time v1.0
 

More from Nick Pentreath

Scaling up deep learning by scaling down
Scaling up deep learning by scaling downScaling up deep learning by scaling down
Scaling up deep learning by scaling downNick Pentreath
 
End-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNXEnd-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNXNick Pentreath
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneNick Pentreath
 
Search and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinSearch and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinNick Pentreath
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsNick Pentreath
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationNick Pentreath
 

More from Nick Pentreath (6)

Scaling up deep learning by scaling down
Scaling up deep learning by scaling downScaling up deep learning by scaling down
Scaling up deep learning by scaling down
 
End-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNXEnd-to-End Deep Learning Deployment with ONNX
End-to-End Deep Learning Deployment with ONNX
 
IBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for EveryoneIBM Developer Model Asset eXchange - Deep Learning for Everyone
IBM Developer Model Asset eXchange - Deep Learning for Everyone
 
Search and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same CoinSearch and Recommendations: 3 Sides of the Same Coin
Search and Recommendations: 3 Sides of the Same Coin
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and Personalization
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Productionizing Spark ML Pipelines with PFA

  • 1. Productionizing Spark ML Pipelines with PFA Nick Pentreath Principal Engineer @MLnick DBG / June 5, 2018 / © 2018 IBM Corporation
  • 2. About @MLnick on Twitter & Github Principal Engineer, IBM CODAIT - Center for Open-Source Data & AI Technologies Machine Learning & AI Apache Spark committer & PMC Author of Machine Learning with Spark Various conferences & meetups DBG / June 5, 2018 / © 2018 IBM Corporation
  • 3. Center for Open Source Data and AI Technologies CODAIT codait.org DBG / June 5, 2018 / © 2018 IBM Corporation CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source Gather Data Analyze Data Machine Learning Deep Learning Deploy Model Maintain Model Python Data Science Stack Fabric for Deep Learning (FfDL) Mleap + PFA Scikit-LearnPandas Apache Spark Apache Spark Jupyter Model Asset eXchange Keras + Tensorflow
  • 4. Agenda The Machine Learning Workflow Challenges of ML Deployment Portable Format for Analytics PFA for Spark ML Performance Comparisons Summary and Future Directions DBG / June 5, 2018 / © 2018 IBM Corporation
  • 5. The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 6. Perception The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 7. In reality the workflow spans teams … The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 8. … and tools … The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation
  • 9. … and is a small (but critical!) piece of the puzzle The Machine Learning Workflow DBG / June 5, 2018 / © 2018 IBM Corporation *Source: Hidden Technical Debt in Machine Learning Systems
  • 10. Machine Learning Deployment DBG / June 5, 2018 / © 2018 IBM Corporation
  • 11. Challenges DBG / June 5, 2018 / © 2018 IBM Corporation • Proliferation of formats • Lack of standardization => custom solutions • Existing standards have limitations Machine Learning Deployment • Need to manage and bridge many different: • Languages - Python, R, Notebooks, Scala / Java / C • Frameworks – too many to count! • Dependencies • Versions • Performance characteristics highly variable across these dimensions • Friction between teams • Data scientists & researchers – latest & greatest • Production – stability, control, performance • Business – metrics, business impact, working product!
  • 12. Spark solves many problems … Machine Learning Deployment DBG / June 5, 2018 / © 2018 IBM Corporation
  • 13. … but introduces additional challenges DBG / June 5, 2018 / © 2018 IBM Corporation • Currently, in order to use trained models outside of Spark, users must: • Write custom readers for Spark’s native format; or • Create their own custom format; or • Export to a standard format (limited support => custom solution) • Scoring outside of Spark also requires custom translation layer between Spark and another ML library Everything is custom! Machine Learning Deployment • Tight coupling to Spark runtime • Introduces complex dependencies • Managing version & compatibility issues • Scoring models in Spark is slow • Overhead of DataFrames, especially query planning • Overhead of task scheduling, even locally • Optimized for batch scoring (includes streaming “micro- batch” settings) • Spark is not suitable for real-time scoring (< few 100ms latency)
  • 14. The Portable Format for Analytics DBG / June 5, 2018 / © 2018 IBM Corporation
  • 15. Overview DBG / June 5, 2018 / © 2018 IBM Corporation • PFA is being championed by the Data Mining Group (IBM is a founding member) • DMG previously created PMML (Predictive Model Markup Language), arguably the only viable open standard currently • PMML has many limitations • PFA was created specifically to address these shortcomings • PFA consists of: • JSON serialization format • AVRO schemas for data types • Encodes functions (actions) that are applied to inputs to create outputs with a set of built-in functions and language constructs (e.g. control-flow, conditionals) • Essentially a mini functional math language + schema specification • Type and function system means PFA can be fully & statically verified on load and run by any compliant execution engine • => portability across languages, frameworks, run times and versions Portable Format for Analytics
  • 16. A Simple Example DBG / June 5, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Example – multi-class logistic regression • Specify input and output types using Avro schemas • Specify the action to perform (typically on input)
  • 17. Managing State DBG / June 5, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Data storage specified by cells • A cell is a named value acting as a global variable • Typically used to store state (such as model coefficients, vocabulary mappings, etc) • Types specified with Avro schemas • Cell values are mutable within an action, but immutable between action executions of a given PFA document • Persistent storage specified by pools • Closer in concept to a database • Pools values are mutable across action executions
  • 18. Other Features DBG / June 5, 2018 / © 2018 IBM Corporation Portable Format for Analytics • Special forms • Control structure – conditionals & loops • Creating and manipulating local variables • User-defined functions including lambdas • Casts • Null checks • (Very) basic try-catch, user-defined errors and logs • Comprehensive built-in function library • Math, strings, arrays, maps, stats, linear algebra • Built-in support for some common models - decision tree, clustering, linear models
  • 19. Aardpfark DBG / June 5, 2018 / © 2018 IBM Corporation PFA and Spark ML • PFA export for Spark ML pipelines • aardpfark-core: Scala DSL for creating PFA documents • avro4s to generate schemas from case classes; json4s to serialize PFA document to JSON • aardpfark-sparkml: uses DSL to export Spark ML components and pipelines to PFA
  • 20. Aardpfark - Challenges DBG / June 5, 2018 / © 2018 IBM Corporation PFA and Spark ML • Spark ML Model has no schema knowledge • E.g. Binarizer can operate on numeric or vector columns • Need to use Avro union types for standalone PFA components and handle all cases in the action logic • Combining components into a pipeline • Trying to match Spark’s DataFrame-based input/output behavior (typically appending columns) • Each component is wrapped as a user-defined function in the PFA document • Current approach mimics passing a Row (i.e. Avro record) from function to function, adding fields • Missing features in PFA • Generic vector support (mixed dense/sparse)
  • 21. Related Open Standards DBG / June 5, 2018 / © 2018 IBM Corporation
  • 22. PMML DBG / June 5, 2018 / © 2018 IBM Corporation Related Open Standards • Data Mining Group (DMG) • Model interchange format in XML with operators • Widely used and supported; open standard • Spark support lacking natively but 3rd party projects available: jpmml-sparkml • Comprehensive support for Spark ML components (perhaps surprisingly!) • Watch SPARK-11237 • Other exporters include scikit-learn, R, XGBoost and LightGBM • Shortcomings • Cannot represent arbitrary programs / analytic applications • Flexibility comes from custom plugins => lose benefits of standardization
  • 23. MLeap DBG / June 5, 2018 / © 2018 IBM Corporation Related Open Standards • Created by Combust.ML, a startup focused on ML model serving • Model interchange format in JSON / Protobuf • Components implemented in Scala code • Good performance • Initially focused on Spark ML. Offers almost complete support for Spark ML components • Recently added some sklearn; working on TensorFlow • Shortcomings • “Open” format, but not a “standard” • No concept of well-defined operators / functions • Effectively forces a tight coupling between versions of model producer / consumer • Must implement custom components in Scala • Impossible to statically verify serialized model
  • 24. Open Neural Network Exchange (ONNX) DBG / June 5, 2018 / © 2018 IBM Corporation • Championed by Facebook & Microsoft • Protobuf serialization format • Describes computation graph (including operators) • In this way the serialized graph is “self-describing” similarly to PFA • More focused on Deep Learning / tensor operations • Will be baked into PyTorch 1.0.0 / Caffe2 as the serialization & interchange format • Shortcomings • No or poor support for more “traditional” ML or language constructs (currently) • Tree-based models & ensembles • String / categorical processing • Control flow • Intermediate variables Related Open Standards
  • 25. Performance DBG / June 5, 2018 / © 2018 IBM Corporation
  • 26. Scoring Performance Comparison DBG / June 5, 2018 / © 2018 IBM Corporation • Comparing scoring performance of PFA with Spark and MLeap • PFA uses Hadrian reference implementation for JVM • Test dataset of ~80,000 records • String indexing of 47 categorical columns • Vector assembling the 47 categorical indices together with 27 numerical columns • Linear regression predictor • Note: Spark time is 1.9s / record (1901ms) - not shown on the chart Performance
  • 27. Conclusion DBG / June 5, 2018 / © 2018 IBM Corporation
  • 28. Summary DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion • PFA provides an open standard for serialization and deployment of analytic workflows • True portability across languages, frameworks, runtimes and versions • Execution environment is independent of the producer (R, scikit-learn, Spark ML, weka, etc) • Solves a significant pain point for the deployment of Spark ML pipelines • Also benefits the wider ecosystem • e.g. many currently use PMML for exporting models from R, scikit-learn, XGBoost, LightGBM, etc. • However there are risks • PFA is still young and needs to gain adoption • Performance in production, at scale, is relatively untested • Tests indicate PFA reference engines need some work on robustness and performance • What about Deep Learning / comparison to ONNX? • Limitations of PFA • A standard can move slowly in terms of new features, fixes and enhancements
  • 29. Open-sourcing Aardpfark DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion • Coverage • Almost all predictors (ML models) • Many feature transformers • Pipeline support (still needs work) • Equivalence tests Spark <-> PFA • Tests for core Scala DSL https://github.com/CODAIT/aardpfark • Need your help! • Finish implementing components • Improve pipeline support • Complete Scala DSL for PFA • Python support • Tests, docs, testing it out!
  • 30. Future directions DBG / June 5, 2018 / © 2018 IBM Corporation Conclusion • Initially focused on Spark ML pipelines • Later add support for scikit-learn pipelines, XGBoost, LightGBM, etc • (Support for many R models exist already in the Hadrian project) • Further performance testing in progress vs Spark & MLeap; also test vs PMML • More automated translation (Scala -> PFA, ASTs etc) • Propose improvements to PFA • Generic vector (tensor) support • Less cumbersome schema definitions • Performance improvements to scoring engine • PFA for Deep Learning? • Comparing to ONNX and other emerging standards • Better suited for the more general pre-processing steps of DL pipelines • Requires all the various DL-specific operators • Requires tensor schema and better tensor support built-in to the PFA spec • Should have GPU support
  • 31. Thank you! codait.org twitter.com/MLnick github.com/MLnick developer.ibm.com/code DBG / June 5, 2018 / © 2018 IBM Corporation FfDL Sign up for IBM Cloud and try Watson Studio! https://datascience.ibm.com/ MAX
  • 32. Links & References Portable Format for Analytics PMML Spark MLlib – Saving and Loading Pipelines Hadrian – Reference Implementation of PFA Engines for JVM, Python, R jpmml-sparkml MLeap Open Neural Network Exchange DBG / June 5, 2018 / © 2018 IBM Corporation
  • 33. Date, Time, Location & Duration Session title and Speaker Tue, June 5 | 11 AM 2010-2012, 30 mins Productionizing Spark ML Pipelines with the Portable Format for Analytics Nick Pentreath (IBM) Tue, June 5 | 2 PM 2018, 30 mins Making PySpark Amazing—From Faster UDFs to Dependency Management and Graphing! Holden Karau (Google) Bryan Cutler (IBM) Tue, June 5 | 2 PM Nook by 2001, 30 mins Making Data and AI Accessible for All Armand Ruiz Gabernet (IBM) Tue, June 5 | 2:40 PM 2002-2004, 30 mins Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System Rajesh Bordawekar (IBM T.J. Watson Research Center) Tue, June 5 | 3:20 PM 3016-3022, 30 mins Dynamic Priorities for Apache Spark Application’s Resource Allocations Michael Feiman (IBM Spectrum Computing) Shinnosuke Okada (IBM Canada Ltd.) Tue, June 5 | 3:20 PM 2001-2005, 30 mins Model Parallelism in Spark ML Cross-Validation Nick Pentreath (IBM) Bryan Cutler (IBM) Tue, June 5 | 3:20 PM 2007, 30 mins Serverless Machine Learning on Modern Hardware Using Apache Spark Patrick Stuedi (IBM) Tue, June 5 | 5:40 PM 2002-2004, 30 mins Create a Loyal Customer Base by Knowing Their Personality Using AI-Based Personality Recommendation Engine; Sourav Mazumder (IBM Analytics) Aradhna Tiwari (University of South Florida) Tue, June 5 | 5:40 PM 2007, 30 mins Transparent GPU Exploitation on Apache Spark Dr. Kazuaki Ishizaki (IBM) Madhusudanan Kandasamy (IBM) Tue, June 5 | 5:40 PM 2009-2011, 30 mins Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for Deep Neural Networks Yonggang Hu (IBM) Chao Xue (IBM) IBM Sessions at Spark+AI Summit 2018 (Tuesday, June 5) DBG / June 5, 2018 / © 2018 IBM Corporation
  • 34. Date, Time, Location & Duration Session title and Speaker Wed, June 6 | 12:50 PM Birds of a Feather: Apache Arrow in Spark and More Bryan Cutler (IBM) Li Jin (Two Sigma Investments, LP) Wed, June 6 | 2 PM 2002-2004, 30 mins Deep Learning for Recommender Systems Nick Pentreath (IBM) ) Wed, June 6 | 3:20 PM 2018, 30 mins Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer Frederick Reiss (IBM) Vijay Bommireddipalli (IBM Center for Open-Source Data & AI Technologies) IBM Sessions at Spark+AI Summit 2018 (Wednesday, June 6) DBG / June 5, 2018 / © 2018 IBM Corporation Meet us at IBM booth in the Expo area.
  • 35. DBG / June 5, 2018 / © 2018 IBM Corporation