SlideShare une entreprise Scribd logo
1  sur  57
www.clairvoyantsoft.com
Productionalizing Spark Streaming
Applications
By: Robert Sanders
Quick Poll
| 2
| 3
Robert Sanders
Big Data Manager and Engineer
Robert Sanders is an Engineering Manager at
Clairvoyant. In his day job, Robert wears multiple hats
and goes back and forth between Architecting and
Engineering large scale Data platforms. Robert has deep
background in enterprise systems, initially working on
fullstack implementations and then focusing on building
Data Management Platforms.
| 4
About
Background Awards & Recognition
Boutique consulting firm centered on building data solutions and
products
All things Web and Data Engineering, Analytics, ML and User
Experience to bring it all together
Support core Hadoop platform, data engineering pipelines and provide
administrative and devops expertise focused on Hadoop
| 5
● What is Spark Streaming and Kafka?
● Steps to Production
○ Managing the Streaming Application (Starting and Stopping)
○ Monitoring
○ Prevent Data Loss
■ Checkpointing
■ Implementing Kafka Delivery Semantics
● Stability
● Summary
Agenda
| 6
● Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams.
What is Spark Streaming?
Spark Streaming - https://spark.apache.org/docs/latest/img/streaming-arch.png
| 7
● Spark processes Micro Batches of data from the input on the Spark Engine
Processing in Spark Streaming
Spark Streaming Processing - https://spark.apache.org/docs/latest/img/streaming-flow.png
| 8
● Apache Kafka® is a Distributed Streaming Platform
● Kafka is a Circular Buffer
○ Data gets written to disk
○ As data gets filled up, old files are removed
What is Kafka?
Kafka - https://kafka.apache.org/images/kafka_diagram.png
| 9
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
The Starting Point
| 10
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
The Starting Point
| 11
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
The Starting Point
| 12
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
The Starting Point
| 13
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point
| 14
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application
| 15
1. Build your JAR (or Python File)
2. Execute the spark-submit command:
$ spark-submit
--class "org.apache.spark.testSimpleApp"
--master local[4]
/path/to/jar/simple-project_2.11-1.0.jar
Starting the Spark Streaming Application
What’s that?
| 16
● Local
○ --master local
○ --master local[2]
○ --master local[*]
● Spark Standalone
○ --master spark://{HOST}:{PORT}/
● Yarn
○ --master yarn
● Mesos
○ --master mesos://{HOST}:{PORT}
● Kubernetes
○ --master k8s://{HOST}:{PORT}
Spark Masters
| 17
● Local
○ --master local
○ --master local[2]
○ --master local[*]
● Spark Standalone
○ --master spark://{HOST}:{PORT}/
● Yarn
○ --master yarn
● Mesos
○ --master mesos://{HOST}:{PORT}
● Kubernetes
○ --master k8s://{HOST}:{PORT}
Spark Masters
| 18
● Spark Version <= 1.6.3
○ Yarn Client Mode:
■ --master yarn-client
○ YARN Cluster Mode:
■ --master yarn-cluster
● Spark Version >= 2.0
○ YARN Client Mode:
■ --master yarn --deploy-mode client
○ YARN Cluster Mode:
■ --master yarn --deploy-mode cluster
Spark-YARN Integration
| 19
Spark Architecture
Spark Architecture - http://blog.cloudera.com/wp-content/uploads/2014/05/spark-yarn-f1.png
| 20
YARN Client Mode
YARN Client Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
| 21
YARN Cluster Mode
YARN Cluster Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
Use YARN Cluster Mode
| 22
| 23
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
YARN Cluster Mode Configurations
| 24
● spark-default.conf
○ spark.yarn.maxAppAttempts=2
○ spark.yarn.am.attemptFailuresValidityInterval=1h
Every 1 hour it will attempt to start the App 2 times.
YARN Cluster Mode Configurations
| 25
$ spark-submit
--class "org.apache.testSimpleApp"
--master yarn
--deploy-mode cluster
--conf spark.yarn.maxAppAttempts=2
--conf spark.yarn.am.attemptFailuresValidityInterval=1h
/path/to/jar/simple-project_2.11-1.0.jar
YARN Cluster Mode Configurations
| 26
val sparkConf = new SparkConf()
.setAppName("App")
.set("spark.yarn.maxAppAttempts", "1")
.set("spark.yarn.am.attemptFailuresValidityInterval", "2h")
val ssc = new StreamingContext(sparkConf, Seconds(2))
YARN Cluster Mode Configurations
| 27
● yarn application -kill {ApplicationID}
Shutting Down the Spark Streaming Application
What if a Micro Batch is
processing when we kill the
application?!
| 28
Shut the Streaming
Application down Gracefully
| 29
| 30
1. On Spark Streaming Startup
a. Create a touch file in HDFS
2. Within the Spark Code
a. Periodically check if the touch file still exists
b. If the touch file doesn’t exist, start the Graceful Shutdown process
3. To Stop
a. Delete the touch file and wait for the Graceful Shutdown process to complete
Tip: Build a shell script to do these start and stop operations
Graceful Shutdown
| 31
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point
| 32
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String]
(topicsSet, kafkaParams))
// Get the lines, split them into words, count the words and print
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)
wordCounts.print()
// Start the computation
ssc.start()
ssc.awaitTermination()
The Starting Point - Step one to Graceful Shutdown
Replace
| 33
var TRIGGER_STOP = false
var ssc: StreamingContext = …
// Define Stream Creation, Transformations and Actions.
ssc.start()
var isStopped = false
while (!isStopped) {
isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS)
if (isStopped)
LOGGER.info("The Spark Streaming context is stopped. Exiting application...")
else
LOGGER.info("Streaming App is still running. Timeout...")
checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION)
if (!isStopped && TRIGGER_STOP) {
LOGGER.info("Stopping the ssc Spark Streaming Context...")
ssc.stop(stopSparkContext = true, stopGracefully = true)
LOGGER.info("Spark Streaming Context is Stopped!")
}
}
Graceful Shutdown
| 34
def checkShutdownMarker(ssc: StreamingContext, runningMarkerTouchFileLocation: String): Unit = {
LOGGER.info("Checking if the running flag (" + runningMarkerTouchFileLocation + ") still exists...")
if (!TRIGGER_STOP) {
val fs = FileSystem.get(ssc.sparkContext.hadoopConfiguration)
val fileExists = fs.exists(new Path(runningMarkerTouchFileLocation))
LOGGER.info("Running File Exists: " + fileExists)
TRIGGER_STOP = !fileExists
if (TRIGGER_STOP)
LOGGER.info("Running File does not exist. Triggering Stop...")
else
LOGGER.info("Running File exists. NOT triggering shutdown.")
} else {
LOGGER.info("Skipping as the Stop Trigger has already been set")
}
}
Graceful Shutdown (cont.)
| 35
● Operational monitoring - Ganglia, Graphite
○ http://spark.apache.org/docs/latest/monitoring#metrics
● StreamingListener (Spark >=2.1)
○ onBatchSubmitted
○ onBatchStarted
○ onBatchCompleted
○ onReceiverStarted
○ onReceiverStopped
○ onReceiverError
Monitoring
| 36
Monitoring - Spark UI - Streaming Tab
| 37
● Metadata checkpointing
○ Configuration
○ DStream Operations
○ Incomplete Batches
● Data checkpointing
○ Saves the RDDs in each microbatch to a reliable storage
Preventing Data Loss - Checkpointing
| 38
● Required If using stateful transformations (updateStateByKey or
reduceByKeyAndWindow)
● Used to recover from Driver failures
Preventing Data Loss - Checkpointing
| 39
val checkpointDirectory = "hdfs://..." // define checkpoint directory
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc // Return the StreamingContext
}
// Get StreamingContext from checkpoint data or create a new one
val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext)
ssc.start() // Start the context
Checkpointing
| 40
● Can’t survive Spark Version Upgrades
● Clear checkpoint between Code Upgrades
Checkpointing Problems
| 41
Receiver Based Streaming
Spark Streaming Receiver Based Streaming - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
| 42
● Data in the Receiver is stored within the Executors Memory
● If we don’t have a WAL, on Executor failure, the data will be lost
● Once the data is written to the WAL, acknowledgement is passed to Kafka
Why have a WAL?
| 43
● Enable Checkpointing
○ Logs will be written to the Checkpoint Directory
● Enable WAL in Spark Configuration
○ spark.streaming.receiver.wrteAheadLog.enable=true
● When using the WAL, the data is already persisted to HDFS. Disable in-memory replication.
○ Use StorageLevel.MEMROY_AND_DISK_SER
Recovering data with the WAL
Thought: Kafka already
stores replicated copies of
the data in a circular buffer.
Why do I need a WAL?
| 44
| 45
Direct Stream
Spark Streaming Direct Stream - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png
Use the Direct Stream
| 46
| 47
● Creating a Topic:
kafka-topics --zookeeper <host>:2181 --create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas>
Kafka Topics
Kafka Writes - https://www.analyticshut.com/wp-content/uploads/2018/04/topic.png
| 48
Consuming from a Kafka Topic
Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
When setting up your Kafka
Topics, setup multiple
Partitions
| 49
| 50
● You need to track your own Kafka Offsets
○ Use ZooKeeper, HDFS, HBase, Kudu, DB, etc
● Checkpoints are not recoverable across code or cluster upgrades
● For Exactly-Once Delivery Semantic
○ store offsets after an idempotent output OR
○ store offsets in an atomic transaction alongside output
Direct Stream Gotchas
| 51
Managing Kafka Offsets
Managing Kafka Offsets - http://blog.cloudera.com/wp-content/uploads/2017/06/Spark-Streaming-flow-for-offsets.png
| 52
val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext)
val kafkaDStream = storedOffsets match {
case None =>
LOGGER.info("storedOffsets was None")
kafkaParams += ("auto.offset.reset" -> "latest")
KafkaUtils.createDirectStream[String, Array[Byte]]
(ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]]
(topicsSet, kafkaParams)
)
case Some(fromOffsets) =>
LOGGER.info("storedOffsets was Some(" + fromOffsets + ")")
kafkaParams += ("auto.offset.reset" -> "none")
KafkaUtils.createDirectStream[String, Array[Byte]]
(ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]]
(fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
}
Managing Offsets
| 53
Average Batch Processing Time < Batch Interval
Stability
| 54
Stability - Spark UI - Streaming Tab
| 55
● Optimize reads, transformations and writes
● Caching
● Increase Parallelism
○ More partitions in Kafka
○ More Executors and Cores
● Repartition the data after receiving the data
○ dstream.repartition(100)
● Increase Batch Duration
Improving Stability
| 56
● Use YARN Cluster Mode
● Gracefully Shutdown your application
● Monitor your job
● Use Checkpointing (but be careful)
● Setup Multiple Partitions in your Kafka Topics
● Use Direct Streams
● Save your Offsets
● Stabilize your Streaming Application
Summary
Thank You!
| 57
Questions?
hello@clairvoyantsoft.com

Contenu connexe

Tendances

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillHenry Saputra
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark JobsDataWorks Summit
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 

Tendances (20)

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsPSUG #52 Dataflow and simplified reactive programming with Akka-streams
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 

Similaire à Productionalizing spark streaming applications

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingDatabricks
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLScyllaDB
 
Event streaming webinar feb 2020
Event streaming webinar feb 2020Event streaming webinar feb 2020
Event streaming webinar feb 2020Maheedhar Gunturu
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafkaDori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka Dori Waldman
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...HostedbyConfluent
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialNeera Agarwal
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一scalaconfjp
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Codemotion
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19confluent
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 

Similaire à Productionalizing spark streaming applications (20)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Strata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark StreamingStrata NYC 2015: What's new in Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
 
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQLBuilding a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
 
Event streaming webinar feb 2020
Event streaming webinar feb 2020Event streaming webinar feb 2020
Event streaming webinar feb 2020
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
Writing Blazing Fast, and Production-Ready Kafka Streams apps in less than 30...
 
KDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics TutorialKDD 2016 Streaming Analytics Tutorial
KDD 2016 Streaming Analytics Tutorial
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
 
Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19Kick your database_to_the_curb_reston_08_27_19
Kick your database_to_the_curb_reston_08_27_19
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 

Plus de Robert Sanders

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudRobert Sanders
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Robert Sanders
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High AvailabilityRobert Sanders
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud OverviewRobert Sanders
 

Plus de Robert Sanders (6)

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow Clustering and High Availability
Airflow Clustering and High AvailabilityAirflow Clustering and High Availability
Airflow Clustering and High Availability
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud Overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Dernier

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 

Dernier (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Productionalizing spark streaming applications

  • 3. | 3 Robert Sanders Big Data Manager and Engineer Robert Sanders is an Engineering Manager at Clairvoyant. In his day job, Robert wears multiple hats and goes back and forth between Architecting and Engineering large scale Data platforms. Robert has deep background in enterprise systems, initially working on fullstack implementations and then focusing on building Data Management Platforms.
  • 4. | 4 About Background Awards & Recognition Boutique consulting firm centered on building data solutions and products All things Web and Data Engineering, Analytics, ML and User Experience to bring it all together Support core Hadoop platform, data engineering pipelines and provide administrative and devops expertise focused on Hadoop
  • 5. | 5 ● What is Spark Streaming and Kafka? ● Steps to Production ○ Managing the Streaming Application (Starting and Stopping) ○ Monitoring ○ Prevent Data Loss ■ Checkpointing ■ Implementing Kafka Delivery Semantics ● Stability ● Summary Agenda
  • 6. | 6 ● Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. What is Spark Streaming? Spark Streaming - https://spark.apache.org/docs/latest/img/streaming-arch.png
  • 7. | 7 ● Spark processes Micro Batches of data from the input on the Spark Engine Processing in Spark Streaming Spark Streaming Processing - https://spark.apache.org/docs/latest/img/streaming-flow.png
  • 8. | 8 ● Apache Kafka® is a Distributed Streaming Platform ● Kafka is a Circular Buffer ○ Data gets written to disk ○ As data gets filled up, old files are removed What is Kafka? Kafka - https://kafka.apache.org/images/kafka_diagram.png
  • 9. | 9 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) The Starting Point
  • 10. | 10 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) The Starting Point
  • 11. | 11 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() The Starting Point
  • 12. | 12 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() The Starting Point
  • 13. | 13 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() The Starting Point
  • 14. | 14 1. Build your JAR (or Python File) 2. Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_2.11-1.0.jar Starting the Spark Streaming Application
  • 15. | 15 1. Build your JAR (or Python File) 2. Execute the spark-submit command: $ spark-submit --class "org.apache.spark.testSimpleApp" --master local[4] /path/to/jar/simple-project_2.11-1.0.jar Starting the Spark Streaming Application What’s that?
  • 16. | 16 ● Local ○ --master local ○ --master local[2] ○ --master local[*] ● Spark Standalone ○ --master spark://{HOST}:{PORT}/ ● Yarn ○ --master yarn ● Mesos ○ --master mesos://{HOST}:{PORT} ● Kubernetes ○ --master k8s://{HOST}:{PORT} Spark Masters
  • 17. | 17 ● Local ○ --master local ○ --master local[2] ○ --master local[*] ● Spark Standalone ○ --master spark://{HOST}:{PORT}/ ● Yarn ○ --master yarn ● Mesos ○ --master mesos://{HOST}:{PORT} ● Kubernetes ○ --master k8s://{HOST}:{PORT} Spark Masters
  • 18. | 18 ● Spark Version <= 1.6.3 ○ Yarn Client Mode: ■ --master yarn-client ○ YARN Cluster Mode: ■ --master yarn-cluster ● Spark Version >= 2.0 ○ YARN Client Mode: ■ --master yarn --deploy-mode client ○ YARN Cluster Mode: ■ --master yarn --deploy-mode cluster Spark-YARN Integration
  • 19. | 19 Spark Architecture Spark Architecture - http://blog.cloudera.com/wp-content/uploads/2014/05/spark-yarn-f1.png
  • 20. | 20 YARN Client Mode YARN Client Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
  • 21. | 21 YARN Cluster Mode YARN Cluster Mode - https://4.bp.blogspot.com/-lFcEx4GDEg4/WMgZQRjDRrI/AAAAAAAADt0/SA1v6gtRGGknkmTINUWbCg5ufEM7rVb9gCLcB/s1600/SparkYanrClusterMode.jpg
  • 22. Use YARN Cluster Mode | 22
  • 23. | 23 ● spark-default.conf ○ spark.yarn.maxAppAttempts=2 ○ spark.yarn.am.attemptFailuresValidityInterval=1h YARN Cluster Mode Configurations
  • 24. | 24 ● spark-default.conf ○ spark.yarn.maxAppAttempts=2 ○ spark.yarn.am.attemptFailuresValidityInterval=1h Every 1 hour it will attempt to start the App 2 times. YARN Cluster Mode Configurations
  • 25. | 25 $ spark-submit --class "org.apache.testSimpleApp" --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=2 --conf spark.yarn.am.attemptFailuresValidityInterval=1h /path/to/jar/simple-project_2.11-1.0.jar YARN Cluster Mode Configurations
  • 26. | 26 val sparkConf = new SparkConf() .setAppName("App") .set("spark.yarn.maxAppAttempts", "1") .set("spark.yarn.am.attemptFailuresValidityInterval", "2h") val ssc = new StreamingContext(sparkConf, Seconds(2)) YARN Cluster Mode Configurations
  • 27. | 27 ● yarn application -kill {ApplicationID} Shutting Down the Spark Streaming Application
  • 28. What if a Micro Batch is processing when we kill the application?! | 28
  • 29. Shut the Streaming Application down Gracefully | 29
  • 30. | 30 1. On Spark Streaming Startup a. Create a touch file in HDFS 2. Within the Spark Code a. Periodically check if the touch file still exists b. If the touch file doesn’t exist, start the Graceful Shutdown process 3. To Stop a. Delete the touch file and wait for the Graceful Shutdown process to complete Tip: Build a shell script to do these start and stop operations Graceful Shutdown
  • 31. | 31 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() The Starting Point
  • 32. | 32 val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, Object]( ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, ConsumerConfig.GROUP_ID_CONFIG -> groupId, ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer], ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer]) val messages = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topicsSet, kafkaParams)) // Get the lines, split them into words, count the words and print val lines = messages.map(_.value) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() The Starting Point - Step one to Graceful Shutdown Replace
  • 33. | 33 var TRIGGER_STOP = false var ssc: StreamingContext = … // Define Stream Creation, Transformations and Actions. ssc.start() var isStopped = false while (!isStopped) { isStopped = ssc.awaitTerminationOrTimeout(SPARK_SHUTDOWN_CHECK_MILLIS) if (isStopped) LOGGER.info("The Spark Streaming context is stopped. Exiting application...") else LOGGER.info("Streaming App is still running. Timeout...") checkShutdownMarker(ssc, SPARK_SHUTDOWN_RUNNING_MARKER_TOUCH_FILE_LOCATION) if (!isStopped && TRIGGER_STOP) { LOGGER.info("Stopping the ssc Spark Streaming Context...") ssc.stop(stopSparkContext = true, stopGracefully = true) LOGGER.info("Spark Streaming Context is Stopped!") } } Graceful Shutdown
  • 34. | 34 def checkShutdownMarker(ssc: StreamingContext, runningMarkerTouchFileLocation: String): Unit = { LOGGER.info("Checking if the running flag (" + runningMarkerTouchFileLocation + ") still exists...") if (!TRIGGER_STOP) { val fs = FileSystem.get(ssc.sparkContext.hadoopConfiguration) val fileExists = fs.exists(new Path(runningMarkerTouchFileLocation)) LOGGER.info("Running File Exists: " + fileExists) TRIGGER_STOP = !fileExists if (TRIGGER_STOP) LOGGER.info("Running File does not exist. Triggering Stop...") else LOGGER.info("Running File exists. NOT triggering shutdown.") } else { LOGGER.info("Skipping as the Stop Trigger has already been set") } } Graceful Shutdown (cont.)
  • 35. | 35 ● Operational monitoring - Ganglia, Graphite ○ http://spark.apache.org/docs/latest/monitoring#metrics ● StreamingListener (Spark >=2.1) ○ onBatchSubmitted ○ onBatchStarted ○ onBatchCompleted ○ onReceiverStarted ○ onReceiverStopped ○ onReceiverError Monitoring
  • 36. | 36 Monitoring - Spark UI - Streaming Tab
  • 37. | 37 ● Metadata checkpointing ○ Configuration ○ DStream Operations ○ Incomplete Batches ● Data checkpointing ○ Saves the RDDs in each microbatch to a reliable storage Preventing Data Loss - Checkpointing
  • 38. | 38 ● Required If using stateful transformations (updateStateByKey or reduceByKeyAndWindow) ● Used to recover from Driver failures Preventing Data Loss - Checkpointing
  • 39. | 39 val checkpointDirectory = "hdfs://..." // define checkpoint directory // Function to create and setup a new StreamingContext def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc // Return the StreamingContext } // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext) ssc.start() // Start the context Checkpointing
  • 40. | 40 ● Can’t survive Spark Version Upgrades ● Clear checkpoint between Code Upgrades Checkpointing Problems
  • 41. | 41 Receiver Based Streaming Spark Streaming Receiver Based Streaming - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.11.42-PM.png
  • 42. | 42 ● Data in the Receiver is stored within the Executors Memory ● If we don’t have a WAL, on Executor failure, the data will be lost ● Once the data is written to the WAL, acknowledgement is passed to Kafka Why have a WAL?
  • 43. | 43 ● Enable Checkpointing ○ Logs will be written to the Checkpoint Directory ● Enable WAL in Spark Configuration ○ spark.streaming.receiver.wrteAheadLog.enable=true ● When using the WAL, the data is already persisted to HDFS. Disable in-memory replication. ○ Use StorageLevel.MEMROY_AND_DISK_SER Recovering data with the WAL
  • 44. Thought: Kafka already stores replicated copies of the data in a circular buffer. Why do I need a WAL? | 44
  • 45. | 45 Direct Stream Spark Streaming Direct Stream - https://databricks.com/wp-content/uploads/2015/03/Screen-Shot-2015-03-29-at-10.14.11-PM.png
  • 46. Use the Direct Stream | 46
  • 47. | 47 ● Creating a Topic: kafka-topics --zookeeper <host>:2181 --create --topic <topic-name> --partitions <number-of-partitions> --replication-factor <number-of-replicas> Kafka Topics Kafka Writes - https://www.analyticshut.com/wp-content/uploads/2018/04/topic.png
  • 48. | 48 Consuming from a Kafka Topic Kafka Reads - https://fizalihsan.github.io/technology/kafka-partition-consumer.png
  • 49. When setting up your Kafka Topics, setup multiple Partitions | 49
  • 50. | 50 ● You need to track your own Kafka Offsets ○ Use ZooKeeper, HDFS, HBase, Kudu, DB, etc ● Checkpoints are not recoverable across code or cluster upgrades ● For Exactly-Once Delivery Semantic ○ store offsets after an idempotent output OR ○ store offsets in an atomic transaction alongside output Direct Stream Gotchas
  • 51. | 51 Managing Kafka Offsets Managing Kafka Offsets - http://blog.cloudera.com/wp-content/uploads/2017/06/Spark-Streaming-flow-for-offsets.png
  • 52. | 52 val storedOffsets: Option[mutable.Map[TopicPartition, Long]] = loadOffsets(spark, kuduContext) val kafkaDStream = storedOffsets match { case None => LOGGER.info("storedOffsets was None") kafkaParams += ("auto.offset.reset" -> "latest") KafkaUtils.createDirectStream[String, Array[Byte]] (ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, Array[Byte]] (topicsSet, kafkaParams) ) case Some(fromOffsets) => LOGGER.info("storedOffsets was Some(" + fromOffsets + ")") kafkaParams += ("auto.offset.reset" -> "none") KafkaUtils.createDirectStream[String, Array[Byte]] (ssc, PreferConsistent, ConsumerStrategies.Assign[String, Array[Byte]] (fromOffsets.keys.toList, kafkaParams, fromOffsets) ) } Managing Offsets
  • 53. | 53 Average Batch Processing Time < Batch Interval Stability
  • 54. | 54 Stability - Spark UI - Streaming Tab
  • 55. | 55 ● Optimize reads, transformations and writes ● Caching ● Increase Parallelism ○ More partitions in Kafka ○ More Executors and Cores ● Repartition the data after receiving the data ○ dstream.repartition(100) ● Increase Batch Duration Improving Stability
  • 56. | 56 ● Use YARN Cluster Mode ● Gracefully Shutdown your application ● Monitor your job ● Use Checkpointing (but be careful) ● Setup Multiple Partitions in your Kafka Topics ● Use Direct Streams ● Save your Offsets ● Stabilize your Streaming Application Summary

Notes de l'éditeur

  1. Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
  2. Metadata checkpointing - Saving of the information defining the streaming computation to fault-tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the streaming application (discussed in detail later). Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpointing must be enabled for applications with any of the following requirements: Usage of stateful transformations - If either updateStateByKey or reduceByKeyAndWindow (with inverse function) is used in the application, then the checkpoint directory must be provided to allow for periodic RDD checkpointing. Recovering from failures of the driver running the application - Metadata checkpoints are used to recover with progress information. Note that simple streaming applications without the aforementioned stateful transformations can be run without enabling checkpointing. The recovery from driver failures will also be partial in that case (some received but unprocessed data may be lost). This is often acceptable and many run Spark Streaming applications in this way. Support for non-Hadoop environments is expected to improve in the future.
  3. Since Kafka already stores the data we won’t need a WAL