SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Julien Simon
Principal Technical Evangelist, AI and Machine Learning
@julsimon
Using Apache Spark
with Amazon SageMaker
October 2018
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Apache Spark on AWS
• Amazon SageMaker
• Combining Spark and SageMaker
• Demos with the SageMaker SDK for Spark
• Getting started
Services covered: Amazon EMR, Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark
• Open-source, distributed processing system.
• In-memory caching and optimized execution for fast performance
(typically 100x faster than Hadoop).
• Batch processing, streaming analytics, machine learning, graph
databases and ad hoc queries.
• API for Java, Scala, Python, R, and SQL.
https://spark.apache.org/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark – DataFrame
• Distributed collection of data organized into named columns.
• Conceptually equivalent to a table in a relational database.
• Wide array of sources: structured files, databases.
• Wide array of formats: text, CSV, JSON, Avro, ORC, Parquet.
{"name": "Jeff"}
{"name": "Boaz", "age":72}
{"name": "Julien", "age":12}
df = spark.read.json("people.json")
df.show()
+----+-------+
| age| name |
+----+-------+
|null| Jeff |
| 72 | Boaz |
| 12 | Julien|
+----+-------+
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MLlib – Machine learning library
• ML algorithms: classification, regression, clustering, collaborative filtering.
• Featurization: feature extraction, transformation, dimensionality reduction.
• Tools for constructing, evaluating and tuning ML pipelines
• Transformer – a transform function that maps a DataFrame into a new one
• Adding a column, changing the rows of a specific column, etc.
• Predicting the label based on the feature vector.
• Estimator – an algorithm that trains on data
• Consists of a fit() function that maps a DataFrame into a Model.
https://spark.apache.org/docs/latest/ml-guide.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spark ML on Amazon EMR: spam detector
Adapted from https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
ETL
Train
Predict
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark on Amazon EMR
• Spark is natively supported in Amazon EMR.
• Amazon S3 connectivity using the EMR File System (EMRFS).
• Amazon Kinesis, Redshift and DynamoDB as data sources.
• Integration with the AWS Glue Data Catalog.
• Auto Scaling to add or remove instances from your cluster.
• Integration with the Amazon EC2 Spot market.
https://aws.amazon.com/emr/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Collect and prepare
training data
Choose and optimize
your ML algorithm
Set up and manage
environments for
training
Train and tune model
(trial and error)
Deploy model
in production
Scale and manage
the production
environment
Easily build, train, and deploy Machine Learning models
FREE
TIER
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Pre-built
notebooks for
common
problems
K-Means Clustering
Principal Component Analysis
Neural Topic Modelling
Factorization Machines
Linear Learner
XGBoost
Latent Dirichlet Allocation
Image Classification
Seq2Seq,
And more!
ALGORITHMS
Apache MXNet, Chainer
TensorFlow, PyTorch
Caffe2, CNTK,
Torch
FRAMEWORKS
S e t u p a n d m a n a g e
e n v i r o n m e n t s f o r
t r a i n i n g
T r a i n a n d t u n e
m o d e l ( t r i a l a n d
e r r o r )
D e p l o y m o d e l
i n p r o d u c t i o n
S c a l e a n d m a n a g e t h e
p r o d u c t i o n e n v i r o n m e n t
Built-in, high-
performance
algorithms
Build
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Pre-built
notebooks for
common
problems
Built-in, high-
performance
algorithms
One-click
training
Hyperparameter
optimization
Build Train
Deploy model
in production
Scale and manage
the production
environment
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Fully managed
hosting with auto-
scaling
One-click
deployment
Pre-built
notebooks for
common
problems
Built-in, high-
performance
algorithms
One-click
training
Hyperparameter
optimization
Build Train Deploy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon ECR
Model Training (on EC2)
Model Hosting (on EC2)
Trainingdata
Modelartifacts
Training code Helper code
Helper codeInference code
GroundTruth
Client application
Inference code
Training code
Inference requestInference response
Inference Endpoint
Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Combining Spark and SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple ETL and Machine Learning
• Different workloads require different instance types.
• Say, M4 for ETL, P3 for training and C5 for prediction?
• If you need GPUs for training, running your EMR cluster on GPU
instances wouldn’t be cost-efficient.
• Scale them independently.
• Avoid oversizing your Spark cluster.
• Avoid time-consuming resizing operations on EMR.
• Run ETL once, train many models in parallel.
• SageMaker terminates training instances automatically.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Run any ML algorithm in any language
• Spark MLlib is great, but you may need something else.
• SageMaker built-in algorithms (ML, DL, NLP).
• Deep Learning libraries, like TensorFlow or Apache MXNet.
• Your own custom code in any language.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deploy ML models in production
• Perform ML predictions without using Spark.
• Save the overhead of the Spark framework.
• Save loading your data in a DataFrame.
• Improve latency for small-batch predictions.
• It can be difficult to achieve low-latency predictions with Spark ML models.
• You can get real-time predictions with models hosted in SageMaker.
• You can use very powerful instances for prediction endpoints.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sample use cases for Spark+SageMaker
• Data preparation and feature engineering before training.
• Data transformation before batch prediction (model reuse).
• Data enrichment with predictions.
• Predict missing values instead of using median.
• Add new predicted features.
• Train on extremely large datasets with built-in algos.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SageMaker SDK for Spark
• Python and Scala SDK, for Apache Spark 2.1.1 and 2.2.
• Pre-installed on EMR 5.11 and later.
• Train, import, deploy and predict with SageMaker models directly from
your Spark application.
• Standalone,
• Integration in Spark MLlib pipelines.
• DataFrames in, DataFrames out:
automatic conversion to and from protobuf (crowd goes wild!)
https://github.com/aws/sagemaker-spark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SageMaker SDK for Spark – built-in algorithms
• High-level API for:
• Linear Learner
• Factorization Machines
• K-Means
• PCA
• LDA
• XGBoost
• The SageMakerEstimator object lets you use other built-in containers
as well any containerized code stored in Amazon ECR (just like the
regular SageMaker SDK).
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
Infinitely scalable algorithms:
no limit to the amount of data that they can process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demos
https://gitlab.com/juliensimon/dlnotebooks
1 – Classifying MNIST in Python with XGBoost (SageMaker)
2 – Clustering MNIST in Scala with K-Means (SageMaker)
3 – Clustering MNIST in Scala with a Pipeline: PCA (MLlib) + K-Means (SageMaker)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started
https://ml.aws
https://aws.amazon.com/sagemaker
https://aws.amazon.com/emr
https://github.com/aws/sagemaker-python-sdk
https://github.com/aws/sagemaker-spark
https://medium.com/@julsimon
https://gitlab.com/juliensimon/dlnotebooks
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
Julien Simon
Principal Technical Evangelist, AI and Machine Learning
@julsimon

Contenu connexe

Tendances

Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheusCeline George
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For WorkflowTimothy Spann
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민NAVER D2
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Mihai Criveti
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 

Tendances (20)

Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민[225]yarn 기반의 deep learning application cluster 구축 김제민
[225]yarn 기반의 deep learning application cluster 구축 김제민
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Grafana
GrafanaGrafana
Grafana
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
Introduction to Amazon S3
Introduction to Amazon S3Introduction to Amazon S3
Introduction to Amazon S3
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 

Similaire à Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Amazon Web Services
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Julien SIMON
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Amazon Web Services
 
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Amazon Web Services
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Amazon Web Services
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Amazon Web Services
 
End to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerEnd to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...Amazon Web Services
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...Amazon Web Services
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activitiesEitan Sela
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAmazon Web Services
 
Enabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetEnabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetAmazon Web Services
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and GluonSoji Adeshina
 
Build, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerBuild, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerAmazon Web Services
 

Similaire à Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28 (20)

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
 
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
 
End to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerEnd to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMaker
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activities
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
 
Enabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetEnabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNet
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and Gluon
 
Build, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerBuild, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMaker
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Julien Simon Principal Technical Evangelist, AI and Machine Learning @julsimon Using Apache Spark with Amazon SageMaker October 2018
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Apache Spark on AWS • Amazon SageMaker • Combining Spark and SageMaker • Demos with the SageMaker SDK for Spark • Getting started Services covered: Amazon EMR, Amazon SageMaker
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark on AWS
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark • Open-source, distributed processing system. • In-memory caching and optimized execution for fast performance (typically 100x faster than Hadoop). • Batch processing, streaming analytics, machine learning, graph databases and ad hoc queries. • API for Java, Scala, Python, R, and SQL. https://spark.apache.org/
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark – DataFrame • Distributed collection of data organized into named columns. • Conceptually equivalent to a table in a relational database. • Wide array of sources: structured files, databases. • Wide array of formats: text, CSV, JSON, Avro, ORC, Parquet. {"name": "Jeff"} {"name": "Boaz", "age":72} {"name": "Julien", "age":12} df = spark.read.json("people.json") df.show() +----+-------+ | age| name | +----+-------+ |null| Jeff | | 72 | Boaz | | 12 | Julien| +----+-------+
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MLlib – Machine learning library • ML algorithms: classification, regression, clustering, collaborative filtering. • Featurization: feature extraction, transformation, dimensionality reduction. • Tools for constructing, evaluating and tuning ML pipelines • Transformer – a transform function that maps a DataFrame into a new one • Adding a column, changing the rows of a specific column, etc. • Predicting the label based on the feature vector. • Estimator – an algorithm that trains on data • Consists of a fit() function that maps a DataFrame into a Model. https://spark.apache.org/docs/latest/ml-guide.html
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spark ML on Amazon EMR: spam detector Adapted from https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala ETL Train Predict
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark on Amazon EMR • Spark is natively supported in Amazon EMR. • Amazon S3 connectivity using the EMR File System (EMRFS). • Amazon Kinesis, Redshift and DynamoDB as data sources. • Integration with the AWS Glue Data Catalog. • Auto Scaling to add or remove instances from your cluster. • Integration with the Amazon EC2 Spot market. https://aws.amazon.com/emr/
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Collect and prepare training data Choose and optimize your ML algorithm Set up and manage environments for training Train and tune model (trial and error) Deploy model in production Scale and manage the production environment Easily build, train, and deploy Machine Learning models FREE TIER
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Pre-built notebooks for common problems K-Means Clustering Principal Component Analysis Neural Topic Modelling Factorization Machines Linear Learner XGBoost Latent Dirichlet Allocation Image Classification Seq2Seq, And more! ALGORITHMS Apache MXNet, Chainer TensorFlow, PyTorch Caffe2, CNTK, Torch FRAMEWORKS S e t u p a n d m a n a g e e n v i r o n m e n t s f o r t r a i n i n g T r a i n a n d t u n e m o d e l ( t r i a l a n d e r r o r ) D e p l o y m o d e l i n p r o d u c t i o n S c a l e a n d m a n a g e t h e p r o d u c t i o n e n v i r o n m e n t Built-in, high- performance algorithms Build
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Pre-built notebooks for common problems Built-in, high- performance algorithms One-click training Hyperparameter optimization Build Train Deploy model in production Scale and manage the production environment
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Fully managed hosting with auto- scaling One-click deployment Pre-built notebooks for common problems Built-in, high- performance algorithms One-click training Hyperparameter optimization Build Train Deploy
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon ECR Model Training (on EC2) Model Hosting (on EC2) Trainingdata Modelartifacts Training code Helper code Helper codeInference code GroundTruth Client application Inference code Training code Inference requestInference response Inference Endpoint Amazon SageMaker
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Combining Spark and SageMaker
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple ETL and Machine Learning • Different workloads require different instance types. • Say, M4 for ETL, P3 for training and C5 for prediction? • If you need GPUs for training, running your EMR cluster on GPU instances wouldn’t be cost-efficient. • Scale them independently. • Avoid oversizing your Spark cluster. • Avoid time-consuming resizing operations on EMR. • Run ETL once, train many models in parallel. • SageMaker terminates training instances automatically.
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Run any ML algorithm in any language • Spark MLlib is great, but you may need something else. • SageMaker built-in algorithms (ML, DL, NLP). • Deep Learning libraries, like TensorFlow or Apache MXNet. • Your own custom code in any language.
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deploy ML models in production • Perform ML predictions without using Spark. • Save the overhead of the Spark framework. • Save loading your data in a DataFrame. • Improve latency for small-batch predictions. • It can be difficult to achieve low-latency predictions with Spark ML models. • You can get real-time predictions with models hosted in SageMaker. • You can use very powerful instances for prediction endpoints.
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sample use cases for Spark+SageMaker • Data preparation and feature engineering before training. • Data transformation before batch prediction (model reuse). • Data enrichment with predictions. • Predict missing values instead of using median. • Add new predicted features. • Train on extremely large datasets with built-in algos.
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker SDK for Spark • Python and Scala SDK, for Apache Spark 2.1.1 and 2.2. • Pre-installed on EMR 5.11 and later. • Train, import, deploy and predict with SageMaker models directly from your Spark application. • Standalone, • Integration in Spark MLlib pipelines. • DataFrames in, DataFrames out: automatic conversion to and from protobuf (crowd goes wild!) https://github.com/aws/sagemaker-spark
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker SDK for Spark – built-in algorithms • High-level API for: • Linear Learner • Factorization Machines • K-Means • PCA • LDA • XGBoost • The SageMakerEstimator object lets you use other built-in containers as well any containerized code stored in Amazon ECR (just like the regular SageMaker SDK). https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html Infinitely scalable algorithms: no limit to the amount of data that they can process
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demos https://gitlab.com/juliensimon/dlnotebooks 1 – Classifying MNIST in Python with XGBoost (SageMaker) 2 – Clustering MNIST in Scala with K-Means (SageMaker) 3 – Clustering MNIST in Scala with a Pipeline: PCA (MLlib) + K-Means (SageMaker)
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started https://ml.aws https://aws.amazon.com/sagemaker https://aws.amazon.com/emr https://github.com/aws/sagemaker-python-sdk https://github.com/aws/sagemaker-spark https://medium.com/@julsimon https://gitlab.com/juliensimon/dlnotebooks
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Julien Simon Principal Technical Evangelist, AI and Machine Learning @julsimon