SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Bartosz.Jankiewicz@gmail.com, Scalapolis 2016
Make yourself a scalable
pipeline with Apache Spark
Google Data Flow, 2014
The future of data processing is unbounded data.
Though bounded data will always have an
important and useful place, it is semantically
subsumed by its unbounded counter- part.
Jaikumar Vijayan, eWeek 2015
Analyst firms like Forrester expect demand for streaming
analytics services and technologies to grow in the next
few years as more organisations try to extract value from
the huge volumes of data being generated these days
from transactions, Web clickstreams, mobile applications
and cloud services.
❖ Integrate user activity information
❖ Enable nearly real-time analytics
❖ Scale to millions visits per day
❖ Respond to rapidly emerging requirements
❖ Enable data-science techniques on top of collected data
❖ Do the above with reasonable cost
IngestionSources
Canonical architecture
web
sensor
audit-event
micro-
service
Apache Spark
❖ Started in 2009
❖ Developed in Scala with Akka
❖ Polyglot: Currently supports Scala, Java, Python and R
❖ The largest BigData community as of 2015
Spark use-cases
❖ Data integration and ETL
❖ Interactive analytics
❖ Machine learning and advanced analytics
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Scalable
• Scalable by design
• Scales to hundreds of nodes
• Proven in production by
many companies
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Fast
• You can optimise both for
latency and throughput
• Reduced hardware appetite
due various optimisations
• Further improvements
added with Structured
Streaming in Spark 2.0
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Programming model
• Functional paradigm
• Easy to run, easy to test
• Polyglot (R, Scala, Python,
Java)
• Batch and streaming APIs
are very similar
• REPL - a.k.a. Spark shell
Apache Spark
❖ Scalable
❖ Fast
❖ Elegant programming
model
❖ Fault tolerant
Fault tollerancy
• Data is distributed and
replicated
• Seamlessly recovers from
node failure
• Zero data loss guarantees
due to write ahead log
Runtime model
Driver Program
Executor #1
Your code Spark
Context
Executor #2
Executor #3
Executor #4
p1
p4
p2
p5
p3
p6
RDD - Resilient Distributed Dataset
Driver Program
Executor #1 Executor #2 Executor #3 Executor #4
val textFile = sc.textFile(“hdfs://…")
Data node #1 Data node #2 Data node #3 Data node #4
val rdd: RDD[String] =
sc.textFile(…)
val wordsRDD = rdd
.flatMap(line => line.split(" "))
val lengthHistogram = wordsRDD
.groupBy(word => word.length)
.collect
val aWords = wordsRDD
.filter(word =>
word.startsWith(“a”))
.saveAsHadoopFile(“hdfs://…”)
Meet DAG
B
C
E
D
F
A
B
C E
D F
A
DStream
❖ Series of small and deterministic batch jobs
❖ Spark chops live stream into batches
❖ Each micro-batch processing produces a result
time [s]1 2 3 4 5 6
RDD1 RDD2 RDD3 RDD4 RDD5 RDD6
val dstream: DStream[String] = …
val wordsStream = dstream
.flatMap(line => line.split(" "))
.transform(_.map(_.toUpper))
.countByValue()
.print()
Streaming program
It’s not a free lunch
❖ The abstractions are leaking
❖ You need to control level of parallelism
❖ You need to understand impact of transformations
❖ Don’t materialise partitions in forEachPartition
operation
Performance factors
• Network operations
• Data locality
• Total number of cores
• How much you can chunk
your work
• Memory usage and GC
• Serialization
Level of parallelism
❖ Number of receivers aligned with number of executors
❖ Number of threads aligned with number of cores and
nature of operations - blocking or non-blocking
❖ Your data needs to be chunked to make use of your
hardware
Stateful transformations
❖ Stateful transformation example
❖ Stateful DStream operators can have infinite lineages
❖ That leads to high failure-recovery time
❖ Spark solves that problem with checkpointing
val actions[(String, UserAction)] = …
val hotCategories = 

actions.mapWithState(StateSpec.function(stateFunction))
Monitoring
❖ Spark Web UI
❖ Metrics:
❖ Console
❖ Ganglia Sink
❖ Graphite Sink (works great with Grafana)
❖ JMX
❖ REST API
Types of sources
❖ Basic sources:
❖ Sockets, HDFS, Akka actors
❖ Advanced sources:
❖ Kafka, Kinesis, Flume, MQTT
❖ Custom sources:
❖ Receiver interface
Apache Kafka
Greasing the wheels for big data
❖ Incredibly fast message bus
❖ Distributed and fault tolerant
❖ Highly scalable
❖ Strong order guarantees
❖ Easy to replicate across multiple regions
Broker 1
Producer
Broker 2
Consumer
Spark 💕 Kafka
❖ Native integration through
direct-stream API
❖ Offsets information are stored
in write ahead logs
❖ Restart of Spark driver reloads
offsets which weren't
processed
❖ Needs to explicitly enabled
Storage consideration
❖ HDFS works well for large, batch workloads
❖ HBase works well for random reads and writes
❖ HDFS is well suited for analytical queries
❖ HBase is well suited for interaction with web pages and
certain types of range queries
❖ It’s pays off to persist all data in raw format
Lessons learnt
Architecture
web
Final thoughts
❖ Start with reasonably large batch duration ~10 seconds
❖ Adopt your level of parallelism
❖ Use Kryo for faster serialisation
❖ Don’t even start without good monitoring
❖ Find bottlenecks using Spark UI and monitoring
❖ The issues usually in surrounding Spark environment
?
The End
Bartosz Jankiewicz
@oborygen
bartosz.jankiewicz@gmail.com
References
❖ http://spark.apache.org/docs/latest/streaming-
programming-guide.html
❖ https://www.gitbook.com/book/jaceklaskowski/mastering-
apache-spark/details
❖ http://milinda.pathirage.org/kappa-architecture.com/
❖ http://lambda-architecture.net
❖ http://www.hammerlab.org/2015/02/27/monitoring-spark-
with-graphite-and-grafana/

Contenu connexe

Tendances

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...confluent
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsGuozhang Wang
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformconfluent
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowPat Patterson
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingJen Aman
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processingconfluent
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka AggregationKafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregationconfluent
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK StackKnoldus Inc.
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtimearupmalakar
 

Tendances (20)

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtime Data Pipelines with Kafka Connect and Spark Streaming
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka AggregationKafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 

Similaire à Apache Spark Streaming

BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comCedric Vidal
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu Behera
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFSri Ambati
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APIshareddatamsft
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architectureAdrian Tanase
 
Operational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in KafkaOperational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in Kafkaconfluent
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams APIconfluent
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 

Similaire à Apache Spark Streaming (20)

BBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.comBBL KAPPA Lesfurets.com
BBL KAPPA Lesfurets.com
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 
Jug - ecosystem
Jug -  ecosystemJug -  ecosystem
Jug - ecosystem
 
Chti jug - 2018-06-26
Chti jug - 2018-06-26Chti jug - 2018-06-26
Chti jug - 2018-06-26
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architecture
 
Operational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in KafkaOperational Analytics on Event Streams in Kafka
Operational Analytics on Event Streams in Kafka
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 

Apache Spark Streaming

  • 1. Bartosz.Jankiewicz@gmail.com, Scalapolis 2016 Make yourself a scalable pipeline with Apache Spark
  • 2. Google Data Flow, 2014 The future of data processing is unbounded data. Though bounded data will always have an important and useful place, it is semantically subsumed by its unbounded counter- part.
  • 3. Jaikumar Vijayan, eWeek 2015 Analyst firms like Forrester expect demand for streaming analytics services and technologies to grow in the next few years as more organisations try to extract value from the huge volumes of data being generated these days from transactions, Web clickstreams, mobile applications and cloud services.
  • 4. ❖ Integrate user activity information ❖ Enable nearly real-time analytics ❖ Scale to millions visits per day ❖ Respond to rapidly emerging requirements ❖ Enable data-science techniques on top of collected data ❖ Do the above with reasonable cost
  • 6. Apache Spark ❖ Started in 2009 ❖ Developed in Scala with Akka ❖ Polyglot: Currently supports Scala, Java, Python and R ❖ The largest BigData community as of 2015
  • 7. Spark use-cases ❖ Data integration and ETL ❖ Interactive analytics ❖ Machine learning and advanced analytics
  • 8. Apache Spark ❖ Scalable ❖ Fast ❖ Elegant programming model ❖ Fault tolerant Scalable • Scalable by design • Scales to hundreds of nodes • Proven in production by many companies
  • 9. Apache Spark ❖ Scalable ❖ Fast ❖ Elegant programming model ❖ Fault tolerant Fast • You can optimise both for latency and throughput • Reduced hardware appetite due various optimisations • Further improvements added with Structured Streaming in Spark 2.0
  • 10. Apache Spark ❖ Scalable ❖ Fast ❖ Elegant programming model ❖ Fault tolerant Programming model • Functional paradigm • Easy to run, easy to test • Polyglot (R, Scala, Python, Java) • Batch and streaming APIs are very similar • REPL - a.k.a. Spark shell
  • 11. Apache Spark ❖ Scalable ❖ Fast ❖ Elegant programming model ❖ Fault tolerant Fault tollerancy • Data is distributed and replicated • Seamlessly recovers from node failure • Zero data loss guarantees due to write ahead log
  • 12. Runtime model Driver Program Executor #1 Your code Spark Context Executor #2 Executor #3 Executor #4 p1 p4 p2 p5 p3 p6
  • 13. RDD - Resilient Distributed Dataset Driver Program Executor #1 Executor #2 Executor #3 Executor #4 val textFile = sc.textFile(“hdfs://…") Data node #1 Data node #2 Data node #3 Data node #4
  • 14. val rdd: RDD[String] = sc.textFile(…) val wordsRDD = rdd .flatMap(line => line.split(" ")) val lengthHistogram = wordsRDD .groupBy(word => word.length) .collect val aWords = wordsRDD .filter(word => word.startsWith(“a”)) .saveAsHadoopFile(“hdfs://…”) Meet DAG B C E D F A B C E D F A
  • 15. DStream ❖ Series of small and deterministic batch jobs ❖ Spark chops live stream into batches ❖ Each micro-batch processing produces a result time [s]1 2 3 4 5 6 RDD1 RDD2 RDD3 RDD4 RDD5 RDD6
  • 16. val dstream: DStream[String] = … val wordsStream = dstream .flatMap(line => line.split(" ")) .transform(_.map(_.toUpper)) .countByValue() .print() Streaming program
  • 17. It’s not a free lunch ❖ The abstractions are leaking ❖ You need to control level of parallelism ❖ You need to understand impact of transformations ❖ Don’t materialise partitions in forEachPartition operation
  • 18. Performance factors • Network operations • Data locality • Total number of cores • How much you can chunk your work • Memory usage and GC • Serialization
  • 19. Level of parallelism ❖ Number of receivers aligned with number of executors ❖ Number of threads aligned with number of cores and nature of operations - blocking or non-blocking ❖ Your data needs to be chunked to make use of your hardware
  • 20. Stateful transformations ❖ Stateful transformation example ❖ Stateful DStream operators can have infinite lineages ❖ That leads to high failure-recovery time ❖ Spark solves that problem with checkpointing val actions[(String, UserAction)] = … val hotCategories = 
 actions.mapWithState(StateSpec.function(stateFunction))
  • 21. Monitoring ❖ Spark Web UI ❖ Metrics: ❖ Console ❖ Ganglia Sink ❖ Graphite Sink (works great with Grafana) ❖ JMX ❖ REST API
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. Types of sources ❖ Basic sources: ❖ Sockets, HDFS, Akka actors ❖ Advanced sources: ❖ Kafka, Kinesis, Flume, MQTT ❖ Custom sources: ❖ Receiver interface
  • 27. Apache Kafka Greasing the wheels for big data ❖ Incredibly fast message bus ❖ Distributed and fault tolerant ❖ Highly scalable ❖ Strong order guarantees ❖ Easy to replicate across multiple regions Broker 1 Producer Broker 2 Consumer
  • 28. Spark 💕 Kafka ❖ Native integration through direct-stream API ❖ Offsets information are stored in write ahead logs ❖ Restart of Spark driver reloads offsets which weren't processed ❖ Needs to explicitly enabled
  • 29. Storage consideration ❖ HDFS works well for large, batch workloads ❖ HBase works well for random reads and writes ❖ HDFS is well suited for analytical queries ❖ HBase is well suited for interaction with web pages and certain types of range queries ❖ It’s pays off to persist all data in raw format
  • 32. Final thoughts ❖ Start with reasonably large batch duration ~10 seconds ❖ Adopt your level of parallelism ❖ Use Kryo for faster serialisation ❖ Don’t even start without good monitoring ❖ Find bottlenecks using Spark UI and monitoring ❖ The issues usually in surrounding Spark environment
  • 33. ?
  • 35. References ❖ http://spark.apache.org/docs/latest/streaming- programming-guide.html ❖ https://www.gitbook.com/book/jaceklaskowski/mastering- apache-spark/details ❖ http://milinda.pathirage.org/kappa-architecture.com/ ❖ http://lambda-architecture.net ❖ http://www.hammerlab.org/2015/02/27/monitoring-spark- with-graphite-and-grafana/