4. - Checkpointing
• Metadata checkpointing
• Includes configuration, DStream operations and incomplete batches
• To recover from failure of the node running the driver of the streaming
application
• Data checkpointing
• Save generated RDDs (Resilient Distributed Dataset) to a reliable storage
5. Checkpointing Situations
• Usage of stateful transformations
• Stateful – use data or intermediate results from previous batches to compute
the results of current batch
• Recovering from failures of the driver running the application
13. DAG
• Directed Acyclic Graph
• Vertices represent the RDDs and the edges represent the Operation
to be applied on RDD
• Advantages
• The lost RDD can be recovered
• To achieve fault tolerance
Source - https://data-flair.training/blogs/dag-in-apache-spark/
14. - WAL (Write Ahead Logs)
• Achieve Zero Data loss
• Saving all data received by the receivers to logs file
• If the process failed the data can be restored from the log file
• Enable WAL
• spark.streaming.receiver.writeAheadLog.enable to true
• If WAL is enabled so cache replication is not necessary
21. Fault tolerance in Structured streaming
Built on Spark SQL engine
• Express streaming the same way as batch
• Dataset/DataFrame API in Scala, Python, Java and R
Incrementally and continuously updating the final result
• Handling event time and late data
• Delivering end to end exactly once fault tolerance semantics
Support stateless and stateful operations
22. Structured streaming
Checkpointing
• In case of a failure or intentional shutdown, can recover the previous progress
and state of a previous query.
• Continue where it left off.
25. Spark running on YARN
• Yarn configuration to restart the spark application if failed
• spark.yarn.maxAppAttempts = attemps_no (default - 2)
• Restart attempt counter interval
• spark.yarn.am.attemptFailuresValidityInterval = 1h
• Executor failures
• spark.yarn.max.executor.failures = {attempts * num_executors}
26. Fault tolerance in Kafka
• High Throughput
• Low latency
• Scalable
• Centralized
• Real time
27. Replication in Kafka
• For each topic Kafka cluster maintains a partitioned log
• Each partition is replicated over a several servers
• One replica is designed as the leader
• Followers replicate leaders and take over if the leader dies
29. Fault-tolerance in Flume
• Handling agent failures
• Durable channels
• File channel - persists all events that are stored in it to disk
• Replicate flows across redundant topologies
30. Fault-tolerance in HDFS
• Achieves fault tolerance mechanism by replication process
• Firstly the file is divided into blocks
• Blocks of data are distributed across different machines present in
HDFS cluster
• If any machine on the HDFS goes down or fails, data can easily access
from other machines
33. Lambda Architecture Cont....
1. All data entering the system is dispatched to both the batch layer
and the speed layer
2. The batch layer has two functions
I. Managing the master dataset
II. Pre-compute the batch views
3. Serving layer indexes the batch views
4. Accommodates all requests that are subject to low latency
requirements and deals with recent data only
5. Merging results from batch views and real-time views
maximum number of executor failures before the application fails
even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted