Fault tolerance

Fault-tolerance and reliable
architecture in Hadoop ecosystem

Content
• Fault-tolerance in Spark Streaming
• Fault tolerance in Structured streaming
• Fault-tolerance in Kafka
• Fault-tolerance in Flume
• Fault-tolerance in HDFS
• Lambda Architecture

Fault-tolerance in Spark Streaming
• Checkpointing
• WAL (Write Ahead Logs)

- Checkpointing
• Metadata checkpointing
• Includes configuration, DStream operations and incomplete batches
• To recover from failure of the node running the driver of the streaming
application
• Data checkpointing
• Save generated RDDs (Resilient Distributed Dataset) to a reliable storage

Checkpointing Situations
• Usage of stateful transformations
• Stateful – use data or intermediate results from previous batches to compute
the results of current batch
• Recovering from failures of the driver running the application

Spark Streaming Application: Receive data

Spark Streaming Application: Process data

DAG
• Directed Acyclic Graph
• Vertices represent the RDDs and the edges represent the Operation
to be applied on RDD
• Advantages
• The lost RDD can be recovered
• To achieve fault tolerance
Source - https://data-flair.training/blogs/dag-in-apache-spark/

- WAL (Write Ahead Logs)
• Achieve Zero Data loss
• Saving all data received by the receivers to logs file
• If the process failed the data can be restored from the log file
• Enable WAL
• spark.streaming.receiver.writeAheadLog.enable to true
• If WAL is enabled so cache replication is not necessary

Received blocks Lost on Restart

Recovering data with Write Ahead Logs

Fault tolerance in Structured streaming
Built on Spark SQL engine
• Express streaming the same way as batch
• Dataset/DataFrame API in Scala, Python, Java and R
Incrementally and continuously updating the final result
• Handling event time and late data
• Delivering end to end exactly once fault tolerance semantics
Support stateless and stateful operations

Structured streaming
Checkpointing
• In case of a failure or intentional shutdown, can recover the previous progress
and state of a previous query.
• Continue where it left off.

Spark running on YARN
• YARN controls
• Resource management
• Scheduling
• Security
• Deployment modes
• Cluster Deployment Mode
• Client Deployment Mode

Client mode Cluster mode

• Yarn configuration to restart the spark application if failed
• spark.yarn.maxAppAttempts = attemps_no (default - 2)
• Restart attempt counter interval
• spark.yarn.am.attemptFailuresValidityInterval = 1h
• Executor failures
• spark.yarn.max.executor.failures = {attempts * num_executors}

Fault tolerance in Kafka
• High Throughput
• Low latency
• Scalable
• Centralized
• Real time

Replication in Kafka
• For each topic Kafka cluster maintains a partitioned log
• Each partition is replicated over a several servers
• One replica is designed as the leader
• Followers replicate leaders and take over if the leader dies

Replication in Kafka
Leader
Follower

Fault-tolerance in Flume
• Handling agent failures
• Durable channels
• File channel - persists all events that are stored in it to disk
• Replicate flows across redundant topologies

Fault-tolerance in HDFS
• Achieves fault tolerance mechanism by replication process
• Firstly the file is divided into blocks
• Blocks of data are distributed across different machines present in
HDFS cluster
• If any machine on the HDFS goes down or fails, data can easily access
from other machines

Lambda Architecture Cont....
1. All data entering the system is dispatched to both the batch layer
and the speed layer
2. The batch layer has two functions
I. Managing the master dataset
II. Pre-compute the batch views
3. Serving layer indexes the batch views
4. Accommodates all requests that are subject to low latency
requirements and deals with recent data only
5. Merging results from batch views and real-time views

Architecture For Log File Read and Analyze

Sources
• https://spark.apache.org/docs/latest/streaming-programming-guide.html
• https://dzone.com/articles/what-are-spark-checkpoints-on-dataframes
• http://cloudurable.com/blog/kafka-architecture-topics/index.html
• http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-
apache-kafka-for-event-processing/
• https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-
and-zero-data-loss-in-spark-streaming.html
• https://issues.apache.org/jira/browse/SPARK-3129
• http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/
• https://kafka.apache.org/quickstart

Fault tolerance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Fault tolerance

Similaire à Fault tolerance (20)

Dernier

Dernier (20)

Fault tolerance

Notes de l'éditeur