2. Apache Spark
• Apache Spark is a lightning-fast cluster computing designed for fast
computation
• It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
which includes Interactive Queries and Stream Processing
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management
• Spark uses Hadoop in two ways – one is storage and second is
processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only
2
3. Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools
3
4. Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk. This is possible by reducing number of read/write operations to
disk. It stores the intermediate processing data in memory
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms
4
5. Components of Spark
• The following illustration depicts the different components of Spark
Apache Spark Core
• Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems
5
6. Components of Spark
Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data
Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data
MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout
GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API
6
7. Spark Architecture
Spark Architecture includes following three main components:
• Data Storage
• API
• Management Framework
Data Storage:
• Spark uses HDFS file system for data storage purposes. It works with
any Hadoop compatible data source including HDFS, HBase,
Cassandra, etc.
7
8. Spark Architecture
API:
• The API provides the application developers to create Spark based
applications using a standard API interface. Spark provides API for
Scala, Java, and Python programming languages
Resource Management:
• Spark can be deployed as a Stand-alone server or it can be on a
distributed computing framework like Mesos or YARN
8
9. Resilient Distributed Datasets
• Resilient Distributed Datasets is the core concept in Spark framework
• Spark stores data in RDD on different partitions
• They help with rearranging the computations and optimizing the data
processing
• They are also fault tolerance because an RDD know how to recreate
and recompute the datasets
• RDDs are immutable. You can modify an RDD with a transformation
but the transformation returns you a new RDD whereas the original
RDD remains the same
9
10. Resilient Distributed Datasets
• It provides API for various transformations and materializations of
data as well as for control over caching and partitioning of elements
to optimize data placement
• RDD can be created either from external storage or from another RDD
and stores information about its parents recompute partition in case
of failure
10
11. Resilient Distributed Datasets
RDD supports two types of operations:
• Transformation: Transformations don't return a single value, they return a
new RDD. Nothing gets evaluated when you call a Transformation function,
it just takes an RDD and return a new RDD
• Some of the Transformation functions are map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, pipe, and coalesce
• Action: Action operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing queries
are computed at that time and the result value is returned
• Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach
11
12. RDD Persistence
• One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations
• When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset. This allows future actions to be much faster
• Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
12
14. Components
• Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in main program (called the driver
program)
• The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN), which allocate
resources across applications
• Spark acquires executors on nodes in the cluster, which are processes that
run computations and store data for application
• Next, it sends application code (defined by JAR or Python files passed to
SparkContext) to the executors
• Finally, SparkContext sends tasks to the executors to run
14
15. Components
There are several useful things to note about this architecture:
• Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads
• The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes
• Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network
15
16. Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP
sockets, and can be processed using complex algorithms expressed with
high-level functions like map, reduce, join and window
• Finally, processed data can be pushed out to filesystems
16
17. Spark Streaming
• The way Spark Streaming works is it divides the live stream of data
into batches (called micro batches) of a pre-defined interval (N
seconds) and then treats each batch of data as RDDs
• It's important to decide the time interval for Spark Streaming, based
on your use case and data processing requirements
• If the value of N is too low, then the micro-batches will not have
enough data to give meaningful results during the analysis
17
19. Spark Streaming
• Spark Streaming receives live input data streams and divides the data
into batches, which are then processed by the Spark engine to
generate the final stream of results in batches
• Spark Streaming provides a high-level abstraction called discretized
stream or DStream, which represents a continuous stream of data.
Internally, a DStream is represented as a sequence of RDDs
19
20. Discretized Streams (DStreams)
• It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated
by transforming the input stream
• Internally, a DStream is represented by a continuous series of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a DStream contains data from a certain interval
20
21. Spark runtime components
21
Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue
boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots
are in white boxes.
22. Responsibilities of the client process
component
• The client process starts the driver program
• For example, the client process can be a spark-submit script for
running applications, a spark-shell script, or a custom application
using Spark API
• The client process prepares the class path and all configuration
options for the Spark application
• It also passes application arguments, if any, to the application running
inside the driver
22
23. Responsibilities of the driver component
• The driver orchestrates and monitors execution of a Spark application
• There’s always one driver per Spark application
• The Spark context and scheduler – are responsible for:
• Requesting memory and CPU resources from cluster managers
• Breaking application logic into stages and tasks
• Sending tasks to executors
• Collecting the results
23
24. Responsibilities of the driver component
24
Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s
JVM process.
25. Responsibilities of the driver component
Two basic ways the driver program can be run are:
• Cluster deploy mode is depicted in figure 1. In this mode, the driver
process runs as a separate JVM process inside a cluster, and the
cluster manages its resources
• Client deploy mode is depicted in figure 2. In this mode, the driver’s
running inside the client’s JVM process and communicates with the
executors managed by the cluster
25
26. Responsibilities of the executors
• The executors, which JVM processes, accept tasks from the driver,
execute those tasks, and return the results to the driver
• Each executor has several task slots (or CPU cores) for running tasks in
parallel
• Although these task slots are often referred to as CPU cores in Spark,
they’re implemented as threads and don’t need to correspond to the
number of physical CPU cores on the machine
26
27. Creation of the Spark context
• Once the driver’s started, it configures an instance of SparkContext
• When running a standalone Spark application by submitting a jar file,
or by using Spark API from another program, your Spark application
starts and configures the Spark context
• There can be only one Spark context per JVM
27
28. High-level architecture
• Spark provides a well-defined and layered architecture where all its
layers and components are loosely coupled and integration with
external components/libraries/extensions is performed using well-
defined contracts
28
29. High-level architecture
• Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These
nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage.
• Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark
jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster
memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold
the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as
local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch.
• Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated
applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation
of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the
APIs that are used to request for the allocation and deallocation of available resource across the cluster.
• Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the
Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a
wide variety of applications and languages.
• Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the
Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to
perform ad hoc queries and interactive analysis over large datasets.
29
31. Spark execution model – master worker view
• Spark is built around the concepts of Resilient Distributed Datasets
and Direct Acyclic Graph representing transformations and
dependencies between them
32
32. Spark execution model – master worker view
• Spark Application (often referred to as Driver Program or Application
Master) at high level consists of SparkContext and user code which
interacts with it creating RDDs and performing series of
transformations to achieve final result
• These transformations of RDDs are then translated into DAG and
submitted to Scheduler to be executed on set of worker nodes
33
33. Execution workflow
• User code containing RDD transformations forms Direct Acyclic Graph
which is then split into stages of tasks by DAGScheduler
• Tasks run on workers and results then return to client
34
35. Execution workflow
• SparkContext
• represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast
variables on that cluster
• DAGScheduler
• computes a DAG of stages for each job and submits them to TaskScheduler
• determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum
schedule to run the jobs
• TaskScheduler
• responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
• backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN,
Standalone, local)
• BlockManager
• provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory,
disk, and off-heap)
38