SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Anna Holschuh, Target
Parallelizing With Apache
Spark In Unexpected Ways
#UnifiedAnalytics #SparkAISummit
What This Talk is About
• Tips for parallelizing with Spark
• Lots of (Scala) code examples
• Focus on Scala programming constructs
3#UnifiedAnalytics #SparkAISummit
4#UnifiedAnalytics #SparkAISummit
Who am I
• Lead Data Engineer/Scientist at Target since 2016
• Deep love of all things Target
• Other Spark Summit talks:
o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A
Compiler
o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The
Sparse Details A Bit More Dense
5#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
6#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
7#UnifiedAnalytics #SparkAISummit
Introduction
> Hello, Spark!_
Application
Job
Stage
Task
Driver
Executor
Dataset
Dataframe
RDD
Partition
Action
Transformation
Shuffle
8#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
9#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Let’s do some data exploration
• We have a system of Authors, Articles, and
Comments on those Articles
• We would like to do some simple data
exploration as part of a batch job
• We execute this code in a built jar through
spark-submit on a cluster with 100
executors, 5 executor cores, 10gb/driver,
and 10gb/executor.
• What happens in Spark when we kick off
the exploration?
10#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Execution Starts
• One job is kicked off at a time.
• We asked a few independent questions
in our exploration. Why can’t they be
running at the same time?
11#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Execution Completes
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs run serially.
12#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
One more sanity check • All of our questions,
running serially.
13#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Can we potentially speed up our
exploration?
• Spark turns our questions into 3 Jobs
• The Jobs run serially
• We notice that some of our questions are
independent. Can they be run at the same
time?
• The answer is yes. We can leverage Scala
Concurrency features and the Spark
Scheduler to achieve this…
14#UnifiedAnalytics #SparkAISummit
Scala Futures
• A placeholder for a value that may not
exist.
• Asynchronous
• Requires an ExecutionContext
• Use Await to block
• Extremely flexible syntax. Supports for-
comprehension chaining to manage
dependencies.
Parallel Job Submission
and Schedulers
15#UnifiedAnalytics #SparkAISummit
Parallel Job Submission
and Schedulers
Let’s rework our original code using
Scala Futures to parallelize Job
Submission
• We pull in a reference to an implicit
ExecutionContext
• We wrap each of our questions in a Future
block to be run asynchronously
• We block on our asynchronous questions all
being completed
• (Not seen) We properly shut down the
ExecutorService when the job is complete
16#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Our questions are now
asked concurrently
• All of our questions run as separate
jobs.
• Examining the timing demonstrates that
these jobs are now running concurrently.
17#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
One more sanity check • All of our questions,
running concurrently.
18#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
A note about Spark Schedulers
• The default scheduler is FIFO
• Starting in Spark 0.8, Fair sharing became
available, aka the Fair Scheduler
• Fair Scheduling makes resources available to
all queued Jobs
• Turn on Fair Scheduling through SparkSession
config and supporting allocation pool config
• Threads that submit Spark Jobs should
specify what scheduler pool to use if it’s not
the default
Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html
19#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
The Fair Scheduler is
enabled
20#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Creating a DAG of Futures on the
Driver
• Scala Futures syntax enables for-
comprehensions to represent
dependencies in asynchronous operations
• Spark code can be structured with Futures
to represent a DAG of work on the Driver
• When reworking all code into futures, there
will be some redundancy with Spark’s role
in planning and optimizing, and Spark
handles all of this without issue
21#UnifiedAnalytics #SparkAISummit
Parallel Job Submission and Schedulers
Why use this strategy?
• To maximize resource utilization in your
cluster
• To maximize the concurrency potential of your
job (and thus speed/efficiency)
• Fair Scheduling pools can support different
notions of priority of work in jobs
• Fair Scheduling pools can support multi-user
environments to enable more even resource
allocation in a shared cluster
Takeaways
• Actions trigger Spark to do things (i.e. create
Jobs)
• Spark can certainly handle running multiple
Jobs at once, you just have to tell it to
• This can be accomplished by multithreading
the driver. In Scala, this can be accomplished
using Futures.
• The way tasks are executed when multiple
jobs are running at once can be further
configured through either Spark’s FIFO or Fair
Scheduler with configured supporting pools.
22#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
23#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
A first experience with partitioning
24#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Getting started with partitioning
• .repartition() vs .coalesce()
• Custom partitioning is supported with the
RDD API only (specifically through
implicitly added PairRDDFunctions)
• Spark supports the HashPartitioner and
RangePartitioner out of the box
• One can create custom partitioners by
extending Partitioner to enable custom
strategies in grouping data
25#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
How can non-standard partitioning be
useful?
#1 : Collocating data for joins
• We are joining datasets of Articles and Authors
together by the Author’s id.
• When we pull the raw Article dataset, author ids
are likely to be distributed somewhat randomly
throughout partitions.
• Joins can be considered wide transformations
depending on underlying data and could result in
full shuffles.
• We can cut down on the impact of the shuffle
stage by collocating data by the id to join on
within partitions so there is less cross chatter
during this phase.
26#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
#1: Collocating data for joins
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 5}
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 4}
{author_id: 1}
{author_id: 2}
{author_id: 4}
{author_id: 5}
{author_id: 1}
{author_id: 2}
{author_id: 3}
{author_id: 4}
{id: 1}
{id: 2}
{id: 4}
{id: 5}
{id: 3}
{author_id: 1}
{author_id: 1}
{author_id: 1}
{author_id: 1}
{author_id: 2}
{author_id: 2}
{author_id: 2}
{author_id: 2}
{author_id: 4}
{author_id: 4}
{author_id: 4}
{author_id: 5}
{author_id: 5}
{author_id: 3}
{author_id: 3}
{author_id: 3}
{id: 1}
{id: 2}
{id: 4}
{id: 5}
{id: 3}
AuthorsArticles Articles Authors
27#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
How can non-standard partitioning be
useful?
#2 : Grouping data to operate on partitions
as a whole
• We need to calculate an Author Summary report
that needs to have access to all Articles for an
Author to generate meaningful overall metrics
• We could leverage .map and .reduceByKey to
combine Articles for analysis in a pairwise fashion
or by gathering groups for processing
• Operating on a whole partition grouped by an
Author also accomplishes this goal
28#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Implementing a Custom Partitioner
29#UnifiedAnalytics #SparkAISummit
Partitioning Strategies
Takeaways
• Partitioning can help even out data skew for
more reliable and performant processing.
• The RDD API supports more fine-grained
partitioning with Hash and Range Partitioners.
• One can implement a custom partitioner to
have even more control over how data is
grouped, which creates opportunity for more
performant joins and operations on partitions
as a whole.
• There is expense involved in repartitioning
that has to be balanced against the cost of an
operation on less organized data.
30#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
31#UnifiedAnalytics #SparkAISummit
Typical Spark Usage Patterns
• Load data from a store into a
Dataset/Dataframe/RDD
• Apply various transformations and actions
to explore the data
• Build increasingly complex transformations
by leveraging Spark’s flexibility in
accepting functions into API calls
• What are the limits of these
transformations and how can we move
past them? Can we distribute more
complex computation?
Distributing More Than
Just Data
32#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
#1: Distributing Scripts
• It is often useful to support third party libraries
or scripts from peers who like or need to work
in different languages to accomplish Data
Science goals.
• Often times, these scripts have nothing to do
with Spark and language bindings for libraries
might not work as well as expected when
called directly from Spark due to Serialization
constraints (among other things).
• One can distribute scripts to be executed
within Spark by leveraging Scala’s
scala.sys.process package and Spark’s file
moving capability.
33#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
scala.sys.process
• A package that handles the execution of
external processes
• Provides a concise DSL for running and
chaining processes
• Blocks until the process is complete
• Scripts and commands can be interfaced
with through all of the usual means
(reading stdin, reading local files, writing
stdout, writing local files)
Reference: https://www.scala-lang.org/api/2.11.7/#scala.sys.process.ProcessBuilder
34#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Gotchas
• Make sure the resources provisioned on
executors are suitable to handle the external
process to be run.
• Make sure the external process is built for the
architecture that your cluster runs on and that all
necessary dependencies are available.
• When running more than one executor core,
make sure the external process can handle
having multiple instances of itself running on the
same container.
• When communicating with your process through
the file system, watch out for file system
collisions and be cognizant of cleaning up state .
35#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
#2: Distributing Data Gathering
• APIs have become a very prevalent way of
providing data to systems
• A common operation is gathering data (often
in json) from an API for some entity
• It is possible to leverage Spark APIs to gather
this data due to the flexibility in design of
passing functions to transformations
36#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Gotchas
• Make sure to carefully manage the number
of concurrent connections being opened
to the APIs being used.
• There are always going to be intermittent
blips when hitting APIs at scale. Don’t
forget to use thorough error handling and
retry logic.
37#UnifiedAnalytics #SparkAISummit
Distributing More Than Just Data
Takeaways
• The flexibility of Spark’s APIs allow the
framework to be used for more than a
typical workflow of applying relatively small
functions to distributed data.
• One can distribute scripts to be run as
external processes using data contained in
partitions to build inputs and subsequently
pipe outputs back into Spark APIs.
• One can distribute calls to APIs to gather
data and use Spark mechanisms to control
the load on these external sources.
38#UnifiedAnalytics #SparkAISummit
Agenda
• Introduction
• Parallel Job Submission and Schedulers
• Partitioning Strategies
• Distributing More Than Just Data
39#UnifiedAnalytics #SparkAISummit
Come Work At Target
• We are hiring in Data Science and Data Engineering
• Solve real-world problems in domains ranging from
supply chain logistics to smart stores to
personalization and more
• Offices in…
o Sunnyvale, CA
o Minneapolis, MN
o Pittsburgh, PA
o Bangalore, India
work somewhere you
jobs.target.com
40#UnifiedAnalytics #SparkAISummit
Target @ Spark+AI Summit
Check out our other talks…
2018
• Extending Apache Spark APIs Without Going Near Spark Source Or
A Compiler (Anna Holschuh)
2019
• Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri)
• Lessons In Linear Algebra At Scale With Apache Spark: Let’s make
the sparse details a bit more dense (Anna Holschuh)
41#UnifiedAnalytics #SparkAISummit
Acknowledgements
• Thank you Spark Summit
• Thank you Target
• Thank you wonderful team members at Target
• Thank you vibrant Spark and Scala communities
42#UnifiedAnalytics #SparkAISummit
QUESTIONS
anna.holschuh@target.com
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Contenu connexe

Tendances

Tendances (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with CoscoFlash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 

Similaire à Parallelizing with Apache Spark in Unexpected Ways

Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
Databricks
 

Similaire à Parallelizing with Apache Spark in Unexpected Ways (20)

Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019Transitioning from Java to Scala for Spark - March 13, 2019
Transitioning from Java to Scala for Spark - March 13, 2019
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Infrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache SparkInfrastructure for Deep Learning in Apache Spark
Infrastructure for Deep Learning in Apache Spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Dernier (20)

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

Parallelizing with Apache Spark in Unexpected Ways

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Anna Holschuh, Target Parallelizing With Apache Spark In Unexpected Ways #UnifiedAnalytics #SparkAISummit
  • 3. What This Talk is About • Tips for parallelizing with Spark • Lots of (Scala) code examples • Focus on Scala programming constructs 3#UnifiedAnalytics #SparkAISummit
  • 4. 4#UnifiedAnalytics #SparkAISummit Who am I • Lead Data Engineer/Scientist at Target since 2016 • Deep love of all things Target • Other Spark Summit talks: o 2018: Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler o 2019: Lessons In Linear Algebra At Scale With Apache Spark : Let’s Make The Sparse Details A Bit More Dense
  • 5. 5#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 6. 6#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 7. 7#UnifiedAnalytics #SparkAISummit Introduction > Hello, Spark!_ Application Job Stage Task Driver Executor Dataset Dataframe RDD Partition Action Transformation Shuffle
  • 8. 8#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 9. 9#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Let’s do some data exploration • We have a system of Authors, Articles, and Comments on those Articles • We would like to do some simple data exploration as part of a batch job • We execute this code in a built jar through spark-submit on a cluster with 100 executors, 5 executor cores, 10gb/driver, and 10gb/executor. • What happens in Spark when we kick off the exploration?
  • 10. 10#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers The Execution Starts • One job is kicked off at a time. • We asked a few independent questions in our exploration. Why can’t they be running at the same time?
  • 11. 11#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers The Execution Completes • All of our questions run as separate jobs. • Examining the timing demonstrates that these jobs run serially.
  • 12. 12#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers One more sanity check • All of our questions, running serially.
  • 13. 13#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Can we potentially speed up our exploration? • Spark turns our questions into 3 Jobs • The Jobs run serially • We notice that some of our questions are independent. Can they be run at the same time? • The answer is yes. We can leverage Scala Concurrency features and the Spark Scheduler to achieve this…
  • 14. 14#UnifiedAnalytics #SparkAISummit Scala Futures • A placeholder for a value that may not exist. • Asynchronous • Requires an ExecutionContext • Use Await to block • Extremely flexible syntax. Supports for- comprehension chaining to manage dependencies. Parallel Job Submission and Schedulers
  • 15. 15#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Let’s rework our original code using Scala Futures to parallelize Job Submission • We pull in a reference to an implicit ExecutionContext • We wrap each of our questions in a Future block to be run asynchronously • We block on our asynchronous questions all being completed • (Not seen) We properly shut down the ExecutorService when the job is complete
  • 16. 16#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Our questions are now asked concurrently • All of our questions run as separate jobs. • Examining the timing demonstrates that these jobs are now running concurrently.
  • 17. 17#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers One more sanity check • All of our questions, running concurrently.
  • 18. 18#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers A note about Spark Schedulers • The default scheduler is FIFO • Starting in Spark 0.8, Fair sharing became available, aka the Fair Scheduler • Fair Scheduling makes resources available to all queued Jobs • Turn on Fair Scheduling through SparkSession config and supporting allocation pool config • Threads that submit Spark Jobs should specify what scheduler pool to use if it’s not the default Reference: https://spark.apache.org/docs/2.2.0/job-scheduling.html
  • 19. 19#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers The Fair Scheduler is enabled
  • 20. 20#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Creating a DAG of Futures on the Driver • Scala Futures syntax enables for- comprehensions to represent dependencies in asynchronous operations • Spark code can be structured with Futures to represent a DAG of work on the Driver • When reworking all code into futures, there will be some redundancy with Spark’s role in planning and optimizing, and Spark handles all of this without issue
  • 21. 21#UnifiedAnalytics #SparkAISummit Parallel Job Submission and Schedulers Why use this strategy? • To maximize resource utilization in your cluster • To maximize the concurrency potential of your job (and thus speed/efficiency) • Fair Scheduling pools can support different notions of priority of work in jobs • Fair Scheduling pools can support multi-user environments to enable more even resource allocation in a shared cluster Takeaways • Actions trigger Spark to do things (i.e. create Jobs) • Spark can certainly handle running multiple Jobs at once, you just have to tell it to • This can be accomplished by multithreading the driver. In Scala, this can be accomplished using Futures. • The way tasks are executed when multiple jobs are running at once can be further configured through either Spark’s FIFO or Fair Scheduler with configured supporting pools.
  • 22. 22#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 24. 24#UnifiedAnalytics #SparkAISummit Partitioning Strategies Getting started with partitioning • .repartition() vs .coalesce() • Custom partitioning is supported with the RDD API only (specifically through implicitly added PairRDDFunctions) • Spark supports the HashPartitioner and RangePartitioner out of the box • One can create custom partitioners by extending Partitioner to enable custom strategies in grouping data
  • 25. 25#UnifiedAnalytics #SparkAISummit Partitioning Strategies How can non-standard partitioning be useful? #1 : Collocating data for joins • We are joining datasets of Articles and Authors together by the Author’s id. • When we pull the raw Article dataset, author ids are likely to be distributed somewhat randomly throughout partitions. • Joins can be considered wide transformations depending on underlying data and could result in full shuffles. • We can cut down on the impact of the shuffle stage by collocating data by the id to join on within partitions so there is less cross chatter during this phase.
  • 26. 26#UnifiedAnalytics #SparkAISummit Partitioning Strategies #1: Collocating data for joins {author_id: 1} {author_id: 2} {author_id: 3} {author_id: 5} {author_id: 1} {author_id: 2} {author_id: 3} {author_id: 4} {author_id: 1} {author_id: 2} {author_id: 4} {author_id: 5} {author_id: 1} {author_id: 2} {author_id: 3} {author_id: 4} {id: 1} {id: 2} {id: 4} {id: 5} {id: 3} {author_id: 1} {author_id: 1} {author_id: 1} {author_id: 1} {author_id: 2} {author_id: 2} {author_id: 2} {author_id: 2} {author_id: 4} {author_id: 4} {author_id: 4} {author_id: 5} {author_id: 5} {author_id: 3} {author_id: 3} {author_id: 3} {id: 1} {id: 2} {id: 4} {id: 5} {id: 3} AuthorsArticles Articles Authors
  • 27. 27#UnifiedAnalytics #SparkAISummit Partitioning Strategies How can non-standard partitioning be useful? #2 : Grouping data to operate on partitions as a whole • We need to calculate an Author Summary report that needs to have access to all Articles for an Author to generate meaningful overall metrics • We could leverage .map and .reduceByKey to combine Articles for analysis in a pairwise fashion or by gathering groups for processing • Operating on a whole partition grouped by an Author also accomplishes this goal
  • 29. 29#UnifiedAnalytics #SparkAISummit Partitioning Strategies Takeaways • Partitioning can help even out data skew for more reliable and performant processing. • The RDD API supports more fine-grained partitioning with Hash and Range Partitioners. • One can implement a custom partitioner to have even more control over how data is grouped, which creates opportunity for more performant joins and operations on partitions as a whole. • There is expense involved in repartitioning that has to be balanced against the cost of an operation on less organized data.
  • 30. 30#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 31. 31#UnifiedAnalytics #SparkAISummit Typical Spark Usage Patterns • Load data from a store into a Dataset/Dataframe/RDD • Apply various transformations and actions to explore the data • Build increasingly complex transformations by leveraging Spark’s flexibility in accepting functions into API calls • What are the limits of these transformations and how can we move past them? Can we distribute more complex computation? Distributing More Than Just Data
  • 32. 32#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data #1: Distributing Scripts • It is often useful to support third party libraries or scripts from peers who like or need to work in different languages to accomplish Data Science goals. • Often times, these scripts have nothing to do with Spark and language bindings for libraries might not work as well as expected when called directly from Spark due to Serialization constraints (among other things). • One can distribute scripts to be executed within Spark by leveraging Scala’s scala.sys.process package and Spark’s file moving capability.
  • 33. 33#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data scala.sys.process • A package that handles the execution of external processes • Provides a concise DSL for running and chaining processes • Blocks until the process is complete • Scripts and commands can be interfaced with through all of the usual means (reading stdin, reading local files, writing stdout, writing local files) Reference: https://www.scala-lang.org/api/2.11.7/#scala.sys.process.ProcessBuilder
  • 34. 34#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data Gotchas • Make sure the resources provisioned on executors are suitable to handle the external process to be run. • Make sure the external process is built for the architecture that your cluster runs on and that all necessary dependencies are available. • When running more than one executor core, make sure the external process can handle having multiple instances of itself running on the same container. • When communicating with your process through the file system, watch out for file system collisions and be cognizant of cleaning up state .
  • 35. 35#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data #2: Distributing Data Gathering • APIs have become a very prevalent way of providing data to systems • A common operation is gathering data (often in json) from an API for some entity • It is possible to leverage Spark APIs to gather this data due to the flexibility in design of passing functions to transformations
  • 36. 36#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data Gotchas • Make sure to carefully manage the number of concurrent connections being opened to the APIs being used. • There are always going to be intermittent blips when hitting APIs at scale. Don’t forget to use thorough error handling and retry logic.
  • 37. 37#UnifiedAnalytics #SparkAISummit Distributing More Than Just Data Takeaways • The flexibility of Spark’s APIs allow the framework to be used for more than a typical workflow of applying relatively small functions to distributed data. • One can distribute scripts to be run as external processes using data contained in partitions to build inputs and subsequently pipe outputs back into Spark APIs. • One can distribute calls to APIs to gather data and use Spark mechanisms to control the load on these external sources.
  • 38. 38#UnifiedAnalytics #SparkAISummit Agenda • Introduction • Parallel Job Submission and Schedulers • Partitioning Strategies • Distributing More Than Just Data
  • 39. 39#UnifiedAnalytics #SparkAISummit Come Work At Target • We are hiring in Data Science and Data Engineering • Solve real-world problems in domains ranging from supply chain logistics to smart stores to personalization and more • Offices in… o Sunnyvale, CA o Minneapolis, MN o Pittsburgh, PA o Bangalore, India work somewhere you jobs.target.com
  • 40. 40#UnifiedAnalytics #SparkAISummit Target @ Spark+AI Summit Check out our other talks… 2018 • Extending Apache Spark APIs Without Going Near Spark Source Or A Compiler (Anna Holschuh) 2019 • Apache Spark Data Validation (Doug Balog and Patrick Pisciuneri) • Lessons In Linear Algebra At Scale With Apache Spark: Let’s make the sparse details a bit more dense (Anna Holschuh)
  • 41. 41#UnifiedAnalytics #SparkAISummit Acknowledgements • Thank you Spark Summit • Thank you Target • Thank you wonderful team members at Target • Thank you vibrant Spark and Scala communities
  • 43. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT