SlideShare a Scribd company logo
1 of 34
The Typesafe Reactive Platform and
Apache Spark: Experiences,
Challenges and Roadmaps
Stavros Kontopoulos, MSc
Fast Data and
Typesafe’s Reactive Platform
Fast Data for Reactive Applications
Typesafe’s Fast Data Strategy
• Reactive Platform, Fast Data Architecture
This strategy aims different market needs .
• Microservice architecture with an analytics extension
• Analytics oriented based setup where core infrastructure can be
managed by mesos-like tools and where Kafka, HDFS and several DBs
like Riak, Cassandra are first class citizens.
3
Fast Data for Reactive Applications
Reactive Platform (RP):
• Core elements: Play, Akka, Spark. Scala is the common language.
• ConductR is the glue for managing these elements.
Fast Data Architecture utilizes RP and is meant to provide end-to-end
solutions for highly scalable web apps, IoT and other use cases /
requirements.
4
Fast Data for Reactive Applications
5
Reactive Application traits
Partnerships
Fast Data Partnerships
• Databricks
•Scala insights, backpressure feature
• IBM
•Datapalooza, Big data university (check http://www.spark.tc/)
• Mesosphere
• We deliver production-grade distro of spark on Mesos and DCOS
Reactive Applications 7
“If I have seen further it is by standing on the shoulders of giants”
Isaac Newton
The Team
The Team
A dedicated team which
• Contributes to the Spark project: add features, reviews PRs, test
releases etc.
• Supports customers deploying spark with online support, on-site
trainings.
• Promotes spark technology and/or our RP through talks and other
activities.
• Educates community with high quality courses.
9
More on Contribution
The Project - Contributing
• Where to start?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spar
k
Describes the steps to create a PR.
• Tip: Bug fixes and specifically short fixes can be easier to contribute.
Documentation update etc.
• Things you need to understand as usual:
• local development/test/debugging lifecycle
• How about code style: https://github.com/databricks/scala-style-guide
11
The Project - Contributing...
Tips about debugging:
• Your IDE is your friend, especially with debugging.
You could use SPARK_JAVA_OPTS with spark-shell
SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,
address=5005"
then remote debug your code.
For driver pass the value to: --driver-java-options
For executors pass the value to: spark.executor.extraJavaOptions (SparkConfig)
• As long your IDE has the sources of the code under examination it could
be only spark for example, then you can attach and debug that code
only.
12
The project - A Software Engineering View
•Most active Apache project. Spark is big.
•Project size? A draft impression via… LOC (physical number of lines, CL +
SLOC) weak metric but...
•gives you an idea when you first jump into code
•area size
•you can derive comment density which leads to some interesting
properties (Arafat, O.; Riehle, D.: The Comment Density of Open Source Software Code. IEEE ICSE 2009)
•of course you need to consider complexity, ext libs etc when you actually
start reading the code…
13
The project - A Software Engineering View
Loc Spark: 601396 (scala/ java/ python)
Loc metrics for some major components:
spark/sql: 124898(Scala), 132028 (Java)
spark/core: 114637 (Scala)
spark/mllib: 70278 (Scala)
spark/streaming: 25807 (Scala)
spark/graphx: 7508 (Scala)
14
The Project - Contributing...
Features:
• Spark streaming backpressure for 1.5 (joint work with Databricks,
SPARK-7398)
• Add support for dynamic allocation in the Mesos coarse-grained
scheduler (SPARK-6287)
• Reactive Streams Receiver (SPARK-10420) on going work…
Integration Tests: Created missing integration tests for mesos
deployments:
• https://github.com/typesafehub/mesos-spark-integration-tests
Other:
• Fixes
• PR reviews
• Voting (http://www.apache.org/foundation/voting.html)
15
Back-pressure
Spark Streaming - The Big Picture:
Receivers receive data streams and cut them into batches. Spark
processes the batches each batch interval and emits the output.
16
data streams receivers Spark output
batches
Spark Streaming
Back-pressure
The problem:
“Spark Streaming ingests data through receivers at the rate of the producer (or a
user-configured rate limit). When the processing time for a batch is longer than the
batch interval, the system is unstable, data queues up, exhaust resources and fails
(with an OOM, for example).”
17
receiver
Executor
spark.streaming.receiver.maxRate (default infinite)
number of records per second
block generator
blocks
data stream
Receiver Tracker
block ids
Spark Streaming Driver
Job Generator
Job Scheduler
Spark Context
Spark Driver
runJob
jobSet
clock
tick
Back-pressure
Solution:
For each batch completed estimate a new rate based on previous batch
processing and scheduling delay. Propagate the estimated rate to the block
generator (via ReceiverTracker) which has a RateLimiter (guava 13.0).
Details:
• We need to listen for batch completion
• We need an algorithm to actually estimate the new limit.
RateEstimator algorithm used: PID control
https://en.wikipedia.org/wiki/PID_controller
18
Back-pressure - PID Controller
K{p,i,d} are the coefficients.
What to use for error signal: ingestion speed - processing speed.
It can be shown scheduling delay is kept within a constant factor of the
integral term. Assume processing rate did not change much between to
calculations.
Default coefficients: proportional 1.0, integral 0.2, derived 0.0
19
Back-pressure
Results:
• Backpressure prevents receiver’s buffer to overflow.
• Allows to build end-to-end reactive applications.
• Composability possible.
20
Dynamic Allocation
The problem: Auto-scaling executors in Spark, already available in Yarn was
missing for Mesos.
The general model for cluster managers such as Yarn, Mesos:
Application driver/scheduler uses the cluster to acquire resources and create
executors to run its tasks.
Each executor runs tasks. How many executors you need to run your tasks?
I
21
Dynamic Allocation
How Spark (essentially with an application side) requests executors?
In Coarse-grained mode if dynamic allocation flag is enabled
(spark.dynamicAllocation.enabled property) an instance of
ExecutorAllocationManager (thread) is started from within SparkContext.
Every 100 mills it checks the executors assigned for the current task load
and adjusts the executors needed .
22
Dynamic Allocation
The logic behind executor adjustment in ExecutorAllocationManager ...
Calculate max number of executors needed:
maxNeeded = (pending + running + tasksPerExecutor -1 )/
tasksPersExecutor
numExecutorsTarget= Min (maxNeeded,
spark.dynamicAllocation.executors)
if (numExecutorsTarget < oldTargert) downscale
If (scheduling delay timer expires) upscale is done
Also check executor expire times to kill them.
23
Dynamic Allocation
Connecting to the cluster manager:
Executor number adjust logic calls sc.requestTotalExecutors which calls
the corresponding method in CoarseGrainedSchedulerBackend ( Yarn,
Mesos scheduler classes extend it ) which does the actual executor
management.
• What we did is provide the appropriate methods to Mesos
CoarseGrainScheduler:
def doKillExecutors(executorIds: Seq[String])
def doRequestTotalExecutors(requestedTotal: Int)
24
Dynamic Allocation
In Yarn/Mesos you can call the following api to autoscale your app from
your sparkcontext (supported only in coarse-grained mode):
sc.requestExecutors
sc.killExecutors
But… “the mesos coarse grain scheduler only supports scaling down
since it is already designed to run one executor per slave with the
configured amount of resources.“
“...can scale back up to the same amount of executors”
25
Dynamic Allocation
A smaller problem solved...
Dynamic allocation needs an external shuffle service
However, there is no reliable way to let the shuffle service clean up the
shuffle data when the driver exits, since it may crash before it notifies
the shuffle service and shuffle data will be cached forever.
We need to implement a reliable way to detect driver termination and
clean up shuffle data accordingly.
SPARK-7820, SPARK-8873
26
Mesos Integration Tests
Why?
• This is joint work with Mesosphere.
• Good software engineering practice. Coverage (nice to have)...
• Prohibit mesos spark integration being broken.
• Faster release for spark on mesos.
• Give the spark developer the option to create a local mesos cluster to
test his PR. Anyone can use it, check our repo.
27
Mesos Integration Tests
• It is easy… just build your spark distro, checkout our repository
… and execute ./run_tests.sh distro.tgz
• Optimization on dev lifecycle is needed (still under development).
• Consists of two parts:
• Scripts to create the cluster
• Test runner which runs the tests against the suite.
28
Mesos Integration Tests
• Docker is the technology used to launch the cluster.
• Supports DCOS and local mode.
• Challenges we faced:
• Docker in bridge mode (not supported: SPARK-11638 )
• Write meaningful tests with good assertions.
• Currently the cluster integrates HDFS. We will integrate Zookeeper and
Apache Hive as well.
29
More on Support
Customer Support
• We provide SLAs for different needs eg. 24/7 production issues.
• We offer on-site training / on-line support.
• What customers want so far:
•Training
•On-site consulting / On-line support
• What do customers ask in support cases?
•Customers usually face problems learning the technology eg. how we start
with Spark, but there are also more mature issues eg. large scale
deployment problems.
31
Next Steps
RoadMap
• What is coming...
• Introduce Kerberos security - Challenge here is to deliver the whole thing
authentication, authorization, encryption..
• Work with Mesosphere for Typesafe spark distro and the mesos spark
code area.
• Evaluate Tachyon.
• Officially offer support to other spark libs (graphx, mllib)
•ConductR integration
•Spark notebook
33
©Typesafe 2015 – All Rights Reserved

More Related Content

What's hot

What's hot (20)

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
DevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
A fun cup of joe with open liberty
A fun cup of joe with open libertyA fun cup of joe with open liberty
A fun cup of joe with open liberty
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 

Similar to Typesafe spark- Zalando meetup

Energy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systemsEnergy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systems
Deepak Shankar
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 

Similar to Typesafe spark- Zalando meetup (20)

Energy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systemsEnergy efficient AI workload partitioning on multi-core systems
Energy efficient AI workload partitioning on multi-core systems
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Distributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and ScalaDistributed & Highly Available server applications in Java and Scala
Distributed & Highly Available server applications in Java and Scala
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
IoT Austin CUG talk
IoT Austin CUG talkIoT Austin CUG talk
IoT Austin CUG talk
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 

More from Stavros Kontopoulos

ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 

More from Stavros Kontopoulos (11)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on Flink
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Typesafe spark- Zalando meetup

  • 1. The Typesafe Reactive Platform and Apache Spark: Experiences, Challenges and Roadmaps Stavros Kontopoulos, MSc
  • 2. Fast Data and Typesafe’s Reactive Platform
  • 3. Fast Data for Reactive Applications Typesafe’s Fast Data Strategy • Reactive Platform, Fast Data Architecture This strategy aims different market needs . • Microservice architecture with an analytics extension • Analytics oriented based setup where core infrastructure can be managed by mesos-like tools and where Kafka, HDFS and several DBs like Riak, Cassandra are first class citizens. 3
  • 4. Fast Data for Reactive Applications Reactive Platform (RP): • Core elements: Play, Akka, Spark. Scala is the common language. • ConductR is the glue for managing these elements. Fast Data Architecture utilizes RP and is meant to provide end-to-end solutions for highly scalable web apps, IoT and other use cases / requirements. 4
  • 5. Fast Data for Reactive Applications 5 Reactive Application traits
  • 7. Fast Data Partnerships • Databricks •Scala insights, backpressure feature • IBM •Datapalooza, Big data university (check http://www.spark.tc/) • Mesosphere • We deliver production-grade distro of spark on Mesos and DCOS Reactive Applications 7 “If I have seen further it is by standing on the shoulders of giants” Isaac Newton
  • 9. The Team A dedicated team which • Contributes to the Spark project: add features, reviews PRs, test releases etc. • Supports customers deploying spark with online support, on-site trainings. • Promotes spark technology and/or our RP through talks and other activities. • Educates community with high quality courses. 9
  • 11. The Project - Contributing • Where to start? https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spar k Describes the steps to create a PR. • Tip: Bug fixes and specifically short fixes can be easier to contribute. Documentation update etc. • Things you need to understand as usual: • local development/test/debugging lifecycle • How about code style: https://github.com/databricks/scala-style-guide 11
  • 12. The Project - Contributing... Tips about debugging: • Your IDE is your friend, especially with debugging. You could use SPARK_JAVA_OPTS with spark-shell SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y, address=5005" then remote debug your code. For driver pass the value to: --driver-java-options For executors pass the value to: spark.executor.extraJavaOptions (SparkConfig) • As long your IDE has the sources of the code under examination it could be only spark for example, then you can attach and debug that code only. 12
  • 13. The project - A Software Engineering View •Most active Apache project. Spark is big. •Project size? A draft impression via… LOC (physical number of lines, CL + SLOC) weak metric but... •gives you an idea when you first jump into code •area size •you can derive comment density which leads to some interesting properties (Arafat, O.; Riehle, D.: The Comment Density of Open Source Software Code. IEEE ICSE 2009) •of course you need to consider complexity, ext libs etc when you actually start reading the code… 13
  • 14. The project - A Software Engineering View Loc Spark: 601396 (scala/ java/ python) Loc metrics for some major components: spark/sql: 124898(Scala), 132028 (Java) spark/core: 114637 (Scala) spark/mllib: 70278 (Scala) spark/streaming: 25807 (Scala) spark/graphx: 7508 (Scala) 14
  • 15. The Project - Contributing... Features: • Spark streaming backpressure for 1.5 (joint work with Databricks, SPARK-7398) • Add support for dynamic allocation in the Mesos coarse-grained scheduler (SPARK-6287) • Reactive Streams Receiver (SPARK-10420) on going work… Integration Tests: Created missing integration tests for mesos deployments: • https://github.com/typesafehub/mesos-spark-integration-tests Other: • Fixes • PR reviews • Voting (http://www.apache.org/foundation/voting.html) 15
  • 16. Back-pressure Spark Streaming - The Big Picture: Receivers receive data streams and cut them into batches. Spark processes the batches each batch interval and emits the output. 16 data streams receivers Spark output batches Spark Streaming
  • 17. Back-pressure The problem: “Spark Streaming ingests data through receivers at the rate of the producer (or a user-configured rate limit). When the processing time for a batch is longer than the batch interval, the system is unstable, data queues up, exhaust resources and fails (with an OOM, for example).” 17 receiver Executor spark.streaming.receiver.maxRate (default infinite) number of records per second block generator blocks data stream Receiver Tracker block ids Spark Streaming Driver Job Generator Job Scheduler Spark Context Spark Driver runJob jobSet clock tick
  • 18. Back-pressure Solution: For each batch completed estimate a new rate based on previous batch processing and scheduling delay. Propagate the estimated rate to the block generator (via ReceiverTracker) which has a RateLimiter (guava 13.0). Details: • We need to listen for batch completion • We need an algorithm to actually estimate the new limit. RateEstimator algorithm used: PID control https://en.wikipedia.org/wiki/PID_controller 18
  • 19. Back-pressure - PID Controller K{p,i,d} are the coefficients. What to use for error signal: ingestion speed - processing speed. It can be shown scheduling delay is kept within a constant factor of the integral term. Assume processing rate did not change much between to calculations. Default coefficients: proportional 1.0, integral 0.2, derived 0.0 19
  • 20. Back-pressure Results: • Backpressure prevents receiver’s buffer to overflow. • Allows to build end-to-end reactive applications. • Composability possible. 20
  • 21. Dynamic Allocation The problem: Auto-scaling executors in Spark, already available in Yarn was missing for Mesos. The general model for cluster managers such as Yarn, Mesos: Application driver/scheduler uses the cluster to acquire resources and create executors to run its tasks. Each executor runs tasks. How many executors you need to run your tasks? I 21
  • 22. Dynamic Allocation How Spark (essentially with an application side) requests executors? In Coarse-grained mode if dynamic allocation flag is enabled (spark.dynamicAllocation.enabled property) an instance of ExecutorAllocationManager (thread) is started from within SparkContext. Every 100 mills it checks the executors assigned for the current task load and adjusts the executors needed . 22
  • 23. Dynamic Allocation The logic behind executor adjustment in ExecutorAllocationManager ... Calculate max number of executors needed: maxNeeded = (pending + running + tasksPerExecutor -1 )/ tasksPersExecutor numExecutorsTarget= Min (maxNeeded, spark.dynamicAllocation.executors) if (numExecutorsTarget < oldTargert) downscale If (scheduling delay timer expires) upscale is done Also check executor expire times to kill them. 23
  • 24. Dynamic Allocation Connecting to the cluster manager: Executor number adjust logic calls sc.requestTotalExecutors which calls the corresponding method in CoarseGrainedSchedulerBackend ( Yarn, Mesos scheduler classes extend it ) which does the actual executor management. • What we did is provide the appropriate methods to Mesos CoarseGrainScheduler: def doKillExecutors(executorIds: Seq[String]) def doRequestTotalExecutors(requestedTotal: Int) 24
  • 25. Dynamic Allocation In Yarn/Mesos you can call the following api to autoscale your app from your sparkcontext (supported only in coarse-grained mode): sc.requestExecutors sc.killExecutors But… “the mesos coarse grain scheduler only supports scaling down since it is already designed to run one executor per slave with the configured amount of resources.“ “...can scale back up to the same amount of executors” 25
  • 26. Dynamic Allocation A smaller problem solved... Dynamic allocation needs an external shuffle service However, there is no reliable way to let the shuffle service clean up the shuffle data when the driver exits, since it may crash before it notifies the shuffle service and shuffle data will be cached forever. We need to implement a reliable way to detect driver termination and clean up shuffle data accordingly. SPARK-7820, SPARK-8873 26
  • 27. Mesos Integration Tests Why? • This is joint work with Mesosphere. • Good software engineering practice. Coverage (nice to have)... • Prohibit mesos spark integration being broken. • Faster release for spark on mesos. • Give the spark developer the option to create a local mesos cluster to test his PR. Anyone can use it, check our repo. 27
  • 28. Mesos Integration Tests • It is easy… just build your spark distro, checkout our repository … and execute ./run_tests.sh distro.tgz • Optimization on dev lifecycle is needed (still under development). • Consists of two parts: • Scripts to create the cluster • Test runner which runs the tests against the suite. 28
  • 29. Mesos Integration Tests • Docker is the technology used to launch the cluster. • Supports DCOS and local mode. • Challenges we faced: • Docker in bridge mode (not supported: SPARK-11638 ) • Write meaningful tests with good assertions. • Currently the cluster integrates HDFS. We will integrate Zookeeper and Apache Hive as well. 29
  • 31. Customer Support • We provide SLAs for different needs eg. 24/7 production issues. • We offer on-site training / on-line support. • What customers want so far: •Training •On-site consulting / On-line support • What do customers ask in support cases? •Customers usually face problems learning the technology eg. how we start with Spark, but there are also more mature issues eg. large scale deployment problems. 31
  • 33. RoadMap • What is coming... • Introduce Kerberos security - Challenge here is to deliver the whole thing authentication, authorization, encryption.. • Work with Mesosphere for Typesafe spark distro and the mesos spark code area. • Evaluate Tachyon. • Officially offer support to other spark libs (graphx, mllib) •ConductR integration •Spark notebook 33
  • 34. ©Typesafe 2015 – All Rights Reserved