This document discusses state management in Apache Spark Structured Streaming. It begins by introducing Structured Streaming and differentiating between stateless and stateful stream processing. It then explains the need for state stores to manage intermediate data in stateful processing. It describes how state was managed inefficiently in old Spark Streaming using RDDs and snapshots, and how Structured Streaming improved on this with its decoupled, asynchronous, and incremental state persistence approach. The document outlines Apache Spark's implementation of storing state to HDFS and the involved code entities. It closes by discussing potential issues with this approach and how embedded stores like RocksDB may help address them in production stream processing systems.
How many of you have idea about streaming, Worked on any streaming, understand the word “state management” ? …...should be useful for everyone of you.
information about past input and can be used to influence the processing of future input, will see in detail
Feel free to ask questions at any point of time during presentation
Why you would like to listen this ? Although the talk is specific to Spark Structured Streaming, but the design, architecture, concepts and thought process behind why its there what its there will give you good understanding of any Streaming technology. All are like distant cousins of same family and you will see many overlaps between different streaming systems. Understanding one helps you to understand others. Many of them copy or say are inspired from each other.
Will give you persepective of streaming engine developer
*Quick question: What do you infer from this picture ?
*pretty much sums up difference between batch and stream processing
Batch is data at rest, you take chunk of data each time you process. In streaming you keep getting data and you need to process it as and when the data comes
We will see running version of this example on Qubole Notebook after understanding State Management
START THE CLUSTER
Objective of showing this code example is to give you idea of stateful processing, so when we talk about state management , you can actually relate and understand easily
Having given some rough idea about structured streaming, Lets start with the actual topic that we want to discuss today
By analogy to SQL, the select and where clauses of a query are usually stateless, but join, group by and aggregation functions like sum and count require state.
Intermediate information in stream processing
State of progress: offsets/commits
Often easy to understand when compared with predecessor, evolution is constant process, something new comes because of limitations of old
Story about experience with Stateless stream processing, maintaining offsets in zookeeper
This is the main meat of this talk that I want to go into detail
Prepared diagram on my understanding of the internal code, how it works in upcoming Spark 2.4
It is very important to note here is that all these concepts like incremental checkpointing, asynchronous state management are not specific to Spark Streaming. Will find in other streaming systems like Flink,etc also with different names.
Slide for guys interested in checking out code theirselves
classes/interfaces/method involved in doing the State management
Wont go in detail, instead will show the code flow of the state management in next slide
Stateful operator is the place where logic to interact with state store resides.
Show code
Before I go forward, do you have any questions here
Because now I have a question for you
Do u see any possible issues with this architecture
Honestly I have not encountered any issues but lets discuss what can be possible issues with this approach
Go back to architecture diagram
Had intentionally not talked about RocksDB at the starting, now is the timeReally wanted to talk about this embedded storage or local persistent store
Why Embedded Storage? Became famous because of Flash Memory era/ SSDs , writing to local disks became much faster compared to client-server model over network to storage systems.
Sequential read/write : analogy of airport conveyor belt for spinning disks, latency involved in doing the rotation and seek time going to right sector of the data
Hadoop was about moving processing closer to data, RocksDb is about moving database closer to processing.
Improvised LevelDB : multithreaded write and compaction, support for bloom scans while reading data, improved compaction logic similar to HBase
rocksDB is present in almost every latest streaming systems with need of keeping unlimited state without penalty of network call
Storm : currently does not use local storage like rocksDb. It still relies on remote storages like redis,HBase,cassandra.
Samza : features in LinkedIn like personalized feed to be sent to your wall is decided after joining lot of information with the available feed using Samza
Kafka and Samza were written by same people in LinkedIn who later went on to found company called Confluent where they wrote kafka Streams. So you will find many similarities.
Like said in the beginning, understanding one system will help us understand others.
RocksDB understanding is one of them . Incremental checkpointing, snapshotting, Asynchronous state management are other concepts
Technologies might be different, implementations might be different but after all they are trying to similar problem of distributed world which have same challenges, limitations and expectations like fault tolerance,exactly once processing,etc will be there everywhere
Please have a close watch on Qubole Engineering.
We write lot of interesting stuffs on Big data on cloud, Spark , open sourced SparkLens, Tuning, Hive , Presto, AWS,