SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Deep Learning on Apache®
Spark™: Workflows and Best
Practices
Tim Hunter (Software Engineer)
Jules S. Damji (Spark Community Evangelist)
May 4, 2017
Agenda
• Logistics
• Databricks Overview
• Deep Learning on Apache® Spark™: Workflows and Best
Practices
• Q & A
Logistics
• We can’t hear you…
• Recording will be available...
• Slides and Notebooks will be available...
• Queue up Questions ….
• Orange Button for Tech Support difficulties...
Empower anyone to innovate faster with big data.
Founded by the creators of Apache Spark.
Contributes 75%of the open source code,
10x more than any other company.
VISION
WHO WE ARE
A data processing for data scientists, data engineers, and data
analysts that simplifies that data integration, real-time
experimentation, machine learning and deployment of
production pipelines .
PRODUCT
A New Paradigm
SECOND GENERATION
THE BEST OF BOTH WORLDS
Hadoop + data lake
Hard to centralize data and
extract value with disparate tools
Virtual analytics
• Holisticallyanalyze data from
data warehouses, data lakes,
and other data stores
• Utilize a single engine for batch,
ML, streaming & real-time
queries
• Enable enterprise-wide
collaboration
+
FIRST GENERATION
Data warehouses
ETL process is rigid, scaling out
is expensive, limited to SQL
CLUSTER TUNING &
MANAGEMENT
INTERACTIVE
WORKSPACE
PRODUCTION
PIPELINE
AUTOMATION
OPTIMIZED DATA
ACCESS
DATABRICKS ENTERPRISE SECURITY
YOUR	TEAMS
Data Science
Data Engineering
Many others…
BI Analysts
YOUR	DATA
Cloud Storage
Data Warehouses
Data Lake
VIRTUAL ANALYTICS PLATFORM
Deep Learning on Apache®
Spark™: Workflows and Best
Practices
Tim Hunter (Software Engineer)
May 4 , 2017
About Me
• Tim Hunter
• Software engineer @ Databricks
• Ph.D. from UC Berkeley in Machine Learning
• Very early Spark user
• Contributor to MLlib
• Author of TensorFrames and GraphFrames
Deep Learning and Apache Spark
Deep Learning frameworks w/ Spark bindings
• Caffe (CaffeOnSpark)
• Keras (Elephas)
• MXNet
• Paddle
• TensorFlow(TensorFlowOnSpark,TensorFrames)
Extensions to Spark for specialized hardware
• Blaze (UCLA & Falcon Computing Solutions)
• IBM Conductor with Spark
Native Spark
• BigDL
• DeepDist
• DeepLearning4J
• MLlib
• SparkCL
• SparkNet
Deep Learning and Apache Spark
2016: the year of emerging solutions for Spark + Deep Learning
No consensus
• Many approaches for libraries: integrate existing ones with Spark, build on
top of Spark, modify Spark itself
• Official Spark MLlib support is limited(perceptron-like networks)
One Framework to Rule Them All?
Should we look for The One Deep Learning Framework?
Databricks’ perspective
• Databricks: hosted Spark platform on public cloud
• GPUs for compute-intensive workloads
• Customers use many Deep Learning frameworks: TensorFlow, MXNet, BigDL,
Theano, Caffe, and more
This talk
• Lessons learned from supporting many Deep Learning frameworks
• Multiple ways to integrate Deep Learning & Spark
• Best practices for these integrations
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
ML is a small part of data pipelines.
Hidden	technical	debt	in	Machine	Learning	systems
Sculley et	al.,	NIPS	2016
DL in a data pipeline: Training
Data
collection
ETL Featurization Deep
Learning
Validation Export,
Serving
compute intensive IO intensiveIO intensive
Large cluster
High memory/CPU ratio
Small cluster
Low memory/CPU ratio
DL in a data pipeline: Transformation
Specialized data transforms: feature extraction & prediction
Input Output
cat
dog
dog
Saulius Garalevicius - CC BY-SA3.0
Outline
• Deep Learning in data pipelines
• Recurringpatterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
Recurring patterns
Spark as a scheduler
• Data-parallel tasks
• Data stored outside Spark
Embedded Deep Learning transforms
• Data-parallel tasks
• Data stored in DataFrames/RDDs
Cooperative frameworks
• Multiple passes over data
• Heavy and/or specialized communication
Streaming data through DL
Primary storage choices:
• Cold layer (HDFS/S3/etc.)
• Local storage: files, Spark’s on-disk persistence layer
• In memory: SparkRDDs or SparkDataFrames
Find out if you are I/O constrained or processor-constrained
• How big is your dataset? MNIST or ImageNet?
If using PySpark:
• All frameworks heavily optimized for diskI/O
• Use Spark’s broadcastfor small datasets that fitin memory
• Reading files is fast: use local files when it does not fit
Cooperative frameworks
• Use Spark for data input
• Examples:
• IBM GPU efforts
• Skymind’s DeepLearning4J
• DistML and other Parameter Server efforts
RDD
Partition	1
Partition	n
RDD
Partition	1
Partition	m
Black	box
Cooperative frameworks
• Bypass Spark for asynchronous / specific communication
patterns across machines
• Lose benefit of RDDs and DataFrames and
reproducibility/determinism
• But these guarantees are not requested anyway when doing
deep learning (stochastic gradient)
• “reproducibility is worth a factor of 2” (Leon Bottou, quoted by
John Langford)
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
The GPU software stack
• Deep Learning commonly used with GPUs
• A lot of workon Spark dependencies:
• Few dependencies on local machine when compiling Spark
• The build process works well in a largenumber of configurations (just scala +
maven)
• GPUs present challenges: CUDA, support libraries, drivers, etc.
• Deep softwarestack, requires careful construction (hardware+ drivers + CUDA
+ libraries)
• All these are expected by the user
• Turnkey stacks just starting to appear
• Provide a Docker image with all the GPU SDK
• Pre-install GPU drivers on the instance
Container:
nvidia-docker,
lxc,	etc.
The GPU software stack
GPU	hardware
Linux	kernel NV	Kernel	driver
CuBLAS CuDNN
Deep	learning	libraries
(Tensorflow,	etc.) JCUDA
Python	/	JVM	clients
CUDA
NV	kernel	driver	(userspace interface)
Using GPUs through PySpark
• Popular choice for many independent tasks
• Many DL packages have Python interfaces: TensorFlow,
Theano, Caffe, MXNet, etc.
• Lifetime for python packages: the process
• Requires some configuration tweaks in Spark
PySpark recommendation
• spark.executor.cores = 1
• Gives the DL framework full access over all the resources
• Important for frameworks that optimize processor pipelines
Outline
• Deep Learning in data pipelines
• Recurring patterns in Spark + Deep Learning integrations
• Developer tips
• Monitoring
Monitoring
?
Monitoring
• How do you monitor the progress of your tasks?
• It depends on the granularity
• Around tasks
• Inside (long-running) tasks
Monitoring: Accumulators
• Good to check throughput
or failure rate
• Works for Scala
• Limited use for Python
(for now, SPARK-2868)
• No “real-time” update
batchesAcc = sc.accumulator(1)
def processBatch(i):
global acc
acc += 1
# Process image batch here
images = sc.parallelize(…)
images.map(processBatch).collect()
Monitoring: external system
• Plugs into an external system
• Existing solutions: Grafana, Graphite, Prometheus, etc.
• Most flexible, but more complex to deploy
Conclusion
• Distributed deep learning: exciting and fast-moving space
• Most insights are specific to a task, a dataset and an algorithm:
nothing replaces experiments
• Get started with data-parallel jobs
• Move to cooperative frameworks only when your data are too large.
Challenges to address
For Spark developers
• Monitoringlong-running tasks
• Presentingand introspecting intermediate results
For DL developers
• What boundary to put between the algorithm and Spark?
• How to integrate with Spark at the low-level?
Resources
Recent blog posts — http://databricks.com/blog
• TensorFrames
• GPU acceleration
• Getting started with Deep Learning
• Intel’s BigDL
Docs for Deep Learning on Databricks — http://docs.databricks.com
• Getting started
• Spark integration
SPARK SUMMIT 2017
DATA SCIENCE AND ENGINEERING AT SCALE
JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO
ORGANIZED BY spark-summit.org/2017
Thank You!
Questions?
Happy Sparking & Deep Learning!

Contenu connexe

Tendances

What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentDatabricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiDatabricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 

Tendances (20)

What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 

Similaire à Deep Learning on Apache® Spark™: Workflows and Best Practices

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache SparkQuantUniversity
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkDatabricks
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 

Similaire à Deep Learning on Apache® Spark™: Workflows and Best Practices (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Deep learning and Apache Spark
Deep learning and Apache SparkDeep learning and Apache Spark
Deep learning and Apache Spark
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 

Plus de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 

Dernier (17)

English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 

Deep Learning on Apache® Spark™: Workflows and Best Practices

  • 1. Deep Learning on Apache® Spark™: Workflows and Best Practices Tim Hunter (Software Engineer) Jules S. Damji (Spark Community Evangelist) May 4, 2017
  • 2. Agenda • Logistics • Databricks Overview • Deep Learning on Apache® Spark™: Workflows and Best Practices • Q & A
  • 3. Logistics • We can’t hear you… • Recording will be available... • Slides and Notebooks will be available... • Queue up Questions …. • Orange Button for Tech Support difficulties...
  • 4. Empower anyone to innovate faster with big data. Founded by the creators of Apache Spark. Contributes 75%of the open source code, 10x more than any other company. VISION WHO WE ARE A data processing for data scientists, data engineers, and data analysts that simplifies that data integration, real-time experimentation, machine learning and deployment of production pipelines . PRODUCT
  • 5. A New Paradigm SECOND GENERATION THE BEST OF BOTH WORLDS Hadoop + data lake Hard to centralize data and extract value with disparate tools Virtual analytics • Holisticallyanalyze data from data warehouses, data lakes, and other data stores • Utilize a single engine for batch, ML, streaming & real-time queries • Enable enterprise-wide collaboration + FIRST GENERATION Data warehouses ETL process is rigid, scaling out is expensive, limited to SQL
  • 6. CLUSTER TUNING & MANAGEMENT INTERACTIVE WORKSPACE PRODUCTION PIPELINE AUTOMATION OPTIMIZED DATA ACCESS DATABRICKS ENTERPRISE SECURITY YOUR TEAMS Data Science Data Engineering Many others… BI Analysts YOUR DATA Cloud Storage Data Warehouses Data Lake VIRTUAL ANALYTICS PLATFORM
  • 7. Deep Learning on Apache® Spark™: Workflows and Best Practices Tim Hunter (Software Engineer) May 4 , 2017
  • 8. About Me • Tim Hunter • Software engineer @ Databricks • Ph.D. from UC Berkeley in Machine Learning • Very early Spark user • Contributor to MLlib • Author of TensorFrames and GraphFrames
  • 9. Deep Learning and Apache Spark Deep Learning frameworks w/ Spark bindings • Caffe (CaffeOnSpark) • Keras (Elephas) • MXNet • Paddle • TensorFlow(TensorFlowOnSpark,TensorFrames) Extensions to Spark for specialized hardware • Blaze (UCLA & Falcon Computing Solutions) • IBM Conductor with Spark Native Spark • BigDL • DeepDist • DeepLearning4J • MLlib • SparkCL • SparkNet
  • 10. Deep Learning and Apache Spark 2016: the year of emerging solutions for Spark + Deep Learning No consensus • Many approaches for libraries: integrate existing ones with Spark, build on top of Spark, modify Spark itself • Official Spark MLlib support is limited(perceptron-like networks)
  • 11. One Framework to Rule Them All? Should we look for The One Deep Learning Framework?
  • 12. Databricks’ perspective • Databricks: hosted Spark platform on public cloud • GPUs for compute-intensive workloads • Customers use many Deep Learning frameworks: TensorFlow, MXNet, BigDL, Theano, Caffe, and more This talk • Lessons learned from supporting many Deep Learning frameworks • Multiple ways to integrate Deep Learning & Spark • Best practices for these integrations
  • 13. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 14. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 15. ML is a small part of data pipelines. Hidden technical debt in Machine Learning systems Sculley et al., NIPS 2016
  • 16. DL in a data pipeline: Training Data collection ETL Featurization Deep Learning Validation Export, Serving compute intensive IO intensiveIO intensive Large cluster High memory/CPU ratio Small cluster Low memory/CPU ratio
  • 17. DL in a data pipeline: Transformation Specialized data transforms: feature extraction & prediction Input Output cat dog dog Saulius Garalevicius - CC BY-SA3.0
  • 18. Outline • Deep Learning in data pipelines • Recurringpatterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 19. Recurring patterns Spark as a scheduler • Data-parallel tasks • Data stored outside Spark Embedded Deep Learning transforms • Data-parallel tasks • Data stored in DataFrames/RDDs Cooperative frameworks • Multiple passes over data • Heavy and/or specialized communication
  • 20. Streaming data through DL Primary storage choices: • Cold layer (HDFS/S3/etc.) • Local storage: files, Spark’s on-disk persistence layer • In memory: SparkRDDs or SparkDataFrames Find out if you are I/O constrained or processor-constrained • How big is your dataset? MNIST or ImageNet? If using PySpark: • All frameworks heavily optimized for diskI/O • Use Spark’s broadcastfor small datasets that fitin memory • Reading files is fast: use local files when it does not fit
  • 21. Cooperative frameworks • Use Spark for data input • Examples: • IBM GPU efforts • Skymind’s DeepLearning4J • DistML and other Parameter Server efforts RDD Partition 1 Partition n RDD Partition 1 Partition m Black box
  • 22. Cooperative frameworks • Bypass Spark for asynchronous / specific communication patterns across machines • Lose benefit of RDDs and DataFrames and reproducibility/determinism • But these guarantees are not requested anyway when doing deep learning (stochastic gradient) • “reproducibility is worth a factor of 2” (Leon Bottou, quoted by John Langford)
  • 23. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 24. The GPU software stack • Deep Learning commonly used with GPUs • A lot of workon Spark dependencies: • Few dependencies on local machine when compiling Spark • The build process works well in a largenumber of configurations (just scala + maven) • GPUs present challenges: CUDA, support libraries, drivers, etc. • Deep softwarestack, requires careful construction (hardware+ drivers + CUDA + libraries) • All these are expected by the user • Turnkey stacks just starting to appear
  • 25. • Provide a Docker image with all the GPU SDK • Pre-install GPU drivers on the instance Container: nvidia-docker, lxc, etc. The GPU software stack GPU hardware Linux kernel NV Kernel driver CuBLAS CuDNN Deep learning libraries (Tensorflow, etc.) JCUDA Python / JVM clients CUDA NV kernel driver (userspace interface)
  • 26. Using GPUs through PySpark • Popular choice for many independent tasks • Many DL packages have Python interfaces: TensorFlow, Theano, Caffe, MXNet, etc. • Lifetime for python packages: the process • Requires some configuration tweaks in Spark
  • 27. PySpark recommendation • spark.executor.cores = 1 • Gives the DL framework full access over all the resources • Important for frameworks that optimize processor pipelines
  • 28. Outline • Deep Learning in data pipelines • Recurring patterns in Spark + Deep Learning integrations • Developer tips • Monitoring
  • 30. Monitoring • How do you monitor the progress of your tasks? • It depends on the granularity • Around tasks • Inside (long-running) tasks
  • 31. Monitoring: Accumulators • Good to check throughput or failure rate • Works for Scala • Limited use for Python (for now, SPARK-2868) • No “real-time” update batchesAcc = sc.accumulator(1) def processBatch(i): global acc acc += 1 # Process image batch here images = sc.parallelize(…) images.map(processBatch).collect()
  • 32. Monitoring: external system • Plugs into an external system • Existing solutions: Grafana, Graphite, Prometheus, etc. • Most flexible, but more complex to deploy
  • 33. Conclusion • Distributed deep learning: exciting and fast-moving space • Most insights are specific to a task, a dataset and an algorithm: nothing replaces experiments • Get started with data-parallel jobs • Move to cooperative frameworks only when your data are too large.
  • 34. Challenges to address For Spark developers • Monitoringlong-running tasks • Presentingand introspecting intermediate results For DL developers • What boundary to put between the algorithm and Spark? • How to integrate with Spark at the low-level?
  • 35. Resources Recent blog posts — http://databricks.com/blog • TensorFrames • GPU acceleration • Getting started with Deep Learning • Intel’s BigDL Docs for Deep Learning on Databricks — http://docs.databricks.com • Getting started • Spark integration
  • 36. SPARK SUMMIT 2017 DATA SCIENCE AND ENGINEERING AT SCALE JUNE 5 – 7 | MOSCONE CENTER | SAN FRANCISCO ORGANIZED BY spark-summit.org/2017