1. Interactive Data Analysis in
Spark Streaming
Ways to interact with Spark Streaming applications
https://github.com/Shasidhar/dynamic-streaming
2. ● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● www.shashidhare.com
3. Agenda
● Big data applications
● Streaming applications
● Categories in streaming applications
● Interactive streaming applications
● Different strategies
● Zookeeper
● Apache curator
● Curator cache types
● Spark streaming context dynamic switch
4. Big data applications
● Typically applications in big data are divided depending on their work
loads
● Major divisions are
○ Batch applications
○ Streaming applications
● Most of the existing systems support both of these applications
● But there is a new category of applications are in raise, they are
known as interactive applications
5. Big data interactive applications
● Ability to manipulate data in interactive way
● Exploratory in nature
● Combines batch and streaming data
● For development
○ Zeppelin, Jupiter Notebook
● For production
○ Batch - Datameer, Tellius, Zoomdata
○ Streaming - Stratio Engine, WSO2
6. Streaming Engines
● Ability to process data in real time
● Streaming process includes
○ Collecting data
○ Processing data
● Types of streaming engines
○ Real time
○ Near real time - (Micro Batch)
● Spark allows near real time streaming processing
7. Streaming application types/categories
● Streaming ETL processes
● Decision engines
○ Rule based
○ Machine Learning based
■ Online learning
● Real time Dashboards
● Root cause analysis engines
○ Multiple Streams
○ Handling event times
8. Streaming applications in real world
● Static
○ Data scientist defines rules
○ Admin sets up dashboard
○ Not able to modify the behaviour of streaming application
● Dynamic
○ User can add/delete/modify rules , controls the decision
○ User can see some charts/ design charts
○ Ability to modify the behaviour of streaming application
9. Generic Interactive Application Architecture
Spark
Streaming
Streaming
data source
Streaming
Config source
Downstream
applications
How do we make the
configuration dynamic?
10. Spark streaming introduction
Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
11. Micro batch
● Spark streaming is a fast batch processing system
● Spark streaming collects stream data into small batch
and runs batch processing on it
● Batch can be as small as 1s to as big as multiple hours
● Spark job creation and execution overhead is so low it
can do all that under a sec
● These batches are called as DStreams
12. Spark Streaming application
Define Input streams
Define data
Processing
Define data sync
Micro
Batch
Start Streaming Context
● Options to change behaviour
○ Restart context
○ Without restarting context
■ Control configuration
data
Create Streaming Context
14. Using Kafka as configuration source
Spark
Streaming
Streaming
data source
Config source
(Kafka)
Downstream
applications
15. Using Kafka as Configuration Source
● Easy to adapt as Kafka is the defacto streaming store
● Streaming configuration Source
○ New stream to track the configuration changes
● Spark Streaming
○ Maintain configuration as state in memory and apply
○ State needs to be checkpointed
○ Failure recovery strategies need to be taken care of
● Drawbacks
○ Hard to handle deletes/updates in state
○ Tricky to handle state if configurations are complex
16. Using Database as configuration source
Spark
Streaming
Streaming
data source
Distributed
Database
Downstream
applications
Workers
17. Interactive streaming application Strategies contd.
● Easy to start with databases, as people are familiar with it
● Configuration Source
○ Distributed data source
● Spark Streaming
○ Read configuration from database and apply - Polling
○ Database need to be consistent and fast
○ Configurations can be kept in cache to avoid latencies
● Drawbacks
○ Achieving distributed cache consistency is tricky
○ May be an extra component if you have it only for this purpose
18. Using Zookeeper as configuration source
Spark
Streaming
Streaming
data source
Zookeeper
Downstream
applications
19. Interactive streaming application Strategies contd.
● It is readily available if Kafka is used in a system, no extra burden
● Configuration Source - Zookeeper
● Spark streaming
○ Ability to track the configuration change and take action - Async
Callbacks
○ Suitable to store any type of configuration
○ Allows to adapt listeners for configuration changes
○ Ensures cache consistency by default
● Drawbacks
○ Streaming context restart is not suitable for all systems
20. Apache Zookeeper
“Zookeeper allows distributed processes to coordinate with each other
through a shared hierarchical namespace of data registers”
● Distributed Coordination service
● Hierarchical file system
● Data is stored in ZNode
● Can be thought as a “distributed in-memory file system” with some
limitations like size of data, optimized of high reads and low writes
22. Zookeeper data model
● Follows hierarchical namespace
● Each node is called as ZNode
○ Data saved as bytes
○ Can have children
○ Only accessible through absolute paths
○ Data size limited to 1MB
● Follows global ordering
23. ZNode
● Types
○ Persistent Nodes
■ Exists till explicitly deleted
■ Can have children
○ Ephemeral nodes
■ Exists as long as session is active
■ Cannot have children
● Data can be secured at ZNode level with ACL
24. Data consistency
● Reads are served from local servers
● Writes are synchronised through leader
● Ensures Sequential Consistency
● Data is either read completely or fails
● All clients gets the same result irrespective of the server it is
connected
● Updates are persisted, unless overridden by any client
25. Zookeeper Watches
● Available watches in Zookeeper
○ Node Children Changed
○ Node Changed
○ Node Data Changed
○ Node Deleted
● Watchers are one time triggers
● Event is always received first rather than data
● Client can re register for watch if needed again
27. ZK API issues
● Making client code thread safe is tricky
● Hard for programmers
● Exception handling is bad
● Similar to MapReduce API
Solution is “Apache Curator”
28. Apache Curator
● A Zookeeper Keeper
● Main components
○ Client - Wrapper for ZK class, manages Zookeeper connection
○ Framework - High level API that encloses all ZK related
operations, handles all types of retries.
○ Recipes - Implementation of common Zookeeper “recipes” built of
top of curator framework
● User friendly API
30. Apache Curator caches
Three types of caches
● Node Cache
○ Monitors single node
● Path Cache
○ Monitors a ZNode and children
● Tree Cache
○ Monitors a ZK Path by caching data locally
32. Path Cache
● Monitor a ZNode
● Using archaius - a dynamic property library
● Use ConfigurationSource from archaius to track changes
● Pair Configuration source with UpdateListener
● See in action
Watched
DataSource
Update Listener
34. Spark streaming dynamic restart
● Use the same WatchedSource to track any changes in configuration
● Track changes on zookeeper with patch cache
● Control Streaming context restart on ZK data change
Watched
DataSource
Update Listener
(Restart context)
35. Hands on - Streaming Restart
Git branch : streaming-listener
36. Ways to control data loss
● Enable checkpointing
● Track Kafka topic offsets manually
● Better to use Direct kafka input streams
● Use Kafka monitoring tools to see the status of data processing
● Always create spark streaming context from checkpoint directory
Next Steps
● Try to add some meaningful configurations
● Implement the same idea with Akka actors