This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
2. 1. What is Spark?
2. Components of Spark
Spark Core
Spark SQL
Spark Streaming
Spark MLlib
GraphX
3. Apache Spark Architecture
4. Running a Spark Application
What’s in it for you?
3. What is Apache Spark?
Apache Spark is a top-level open-source cluster computing framework used for real-time
processing and analysis of a large amount of data
4. What is Apache Spark?
Apache Spark is a top-level open-source cluster computing framework used for real-time
processing and analysis of a large amount of data
Fast
processing
Spark processes data faster since it
saves time in reading and writing
operations
5. What is Apache Spark?
Apache Spark is a top-level open-source cluster computing framework used for real-time
processing and analysis of a large amount of data
Fast
processing
Real-time
streaming
Spark processes data faster since it
saves time in reading and writing
operations
Spark allows real-time streaming and
processing of data
6. What is Apache Spark?
Apache Spark is a top-level open-source cluster computing framework used for real-time
processing and analysis of a large amount of data
Fast
processing
Real-time
streaming
In-memory
computation
Spark processes data faster since it
saves time in reading and writing
operations
Spark allows real-time streaming and
processing of data
Spark has DAG execution engine that
provides in-memory computation
7. What is Apache Spark?
Apache Spark is a top-level open-source cluster computing framework used for real-time
processing and analysis of a large amount of data
Fast
processing
Real-time
streaming
In-memory
computation
Fault
tolerant
Spark processes data faster since it
saves time in reading and writing
operations
Spark allows real-time streaming and
processing of data
Spark has DAG execution engine that
provides in-memory computation
Spark is fault tolerant through RDDs which
are designed to handle the failure of any
worker node in the cluster
14. Spark Core
Spark is the core engine for large-scale parallel and distributed data processing
15. Spark Core
Spark is the core engine for large-scale parallel and distributed data processing
Memory management and fault recovery
Scheduling, distributing and monitoring jobs on a cluster
Interacting with storage system
Performs the following:
16. Spark RDD
Resilient Distributed Datasets (RDDs) are the building blocks of any Spark application
Create RDD Transformations
RDD
Actions Results
Transformations are Operations (such as
map, filter, join, union) that are performed on
an RDD that yields a new RDD containing the
result
Actions are operations (such as
reduce, first, count) that return
a value after running a computation
on an RDD
17. Spark SQL
Spark SQL is Apache Spark’s module for working with structured data
SQL
18. Spark SQL
Spark SQL is Apache Spark’s module for working with structured data
SQL
Integrated
You can integrate Spark SQL with
Spark programs and query
structured data inside Spark
programs
Spark SQL features
19. Spark SQL
Spark SQL is Apache Spark’s module for working with structured data
SQL
Integrated
High
Compatibility
You can integrate Spark SQL with
Spark programs and query
structured data inside Spark
programs
You can run unmodified Hive
queries on existing warehouses
in Spark SQL. With existing Hive
data, queries and UDFs, Spark
SQL offers full compatibility
Spark SQL features
20. Spark SQL
Spark SQL is Apache Spark’s module for working with structured data
SQL
Integrated
High
Compatibility
Scalability
You can integrate Spark SQL with
Spark programs and query
structured data inside Spark
programs
You can run unmodified Hive
queries on existing warehouses
in Spark SQL. With existing Hive
data, queries and UDFs, Spark
SQL offers full compatibility
Spark SQL leverages RDD model
as it supports large jobs and mid-
query fault tolerance. Moreover,
for both interactive and long
queries, it uses the same engine
Spark SQL features
21. Spark SQL
Spark SQL is Apache Spark’s module for working with structured data
SQL
Integrated
Spark SQL features
High
Compatibility
Scalability
Standard
Connectivity
You can integrate Spark SQL with
Spark programs and query
structured data inside Spark
programs
You can run unmodified Hive
queries on existing warehouses
in Spark SQL. With existing Hive
data, queries and UDFs, Spark
SQL offers full compatibility
Spark SQL leverages RDD model
as it supports large jobs and mid-
query fault tolerance. Moreover,
for both interactive and long
queries, it uses the same engine
You can easily connect Spark
SQL with JDBC or ODBC. For
connectivity for business
intelligence tools, both turned as
industry norms
22. Spark SQL
Spark SQL is Apache Spark’s module for working with structured data
DataFrame DSLSpark SQL and HQL
DataFrame API
Data Source API
CSV JSON JDBC
SQL Architecture
SQL
23. Spark SQL
Spark SQL has three main layers
Spark SQL is Apache Spark’s module for working with structured data
Language API SchemaRDD Data Sources
Spark is compatible and even
supported by the languages like
Python, HiveQL, Scala, and Java
As Spark SQL works on schema,
tables, and records, you can use
SchemaRDD or data frame as a
temporary table
Data sources for Spark SQL are
different like JSON document, HIVE
tables, and Cassandra database
SQL
24. Spark SQL
Spark allows you to define custom SQL functions called User Defined Functions (UDFs)
SQL
def lowerRemoveAllWhiteSpaces(s: String): String = {
s.tolowerCase().replace(“S”, ‘’”)
}
val lowerRemoveAllWhiteSpacesUDF = udf[String, String]
(lowerRemoveAllWhiteSpaces)
val sourceDF = spark.createDF(
List(
(“ WELCOME “)
(“ SpaRk SqL “)
), List(
(“text”, StringType, true)
)
)
sourceDF.select(
lowerRemoveAllWhiteSpacesUDF(col(“text”)).as(“clean_text”)
).show()
UDF that removes all
the whitespace and
lowercases all the characters
in a string
clean_text
welcome
sparksql
Output
25. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Streaming
26. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Data can be ingested from many sources and the processed data
can be pushed out to different filesystems
Streaming
27. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Data can be ingested from many sources and the processed data
can be pushed out to different filesystems
Streaming
Streaming data sources
Static data sources
28. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Data can be ingested from many sources and the processed data
can be pushed out to different filesystems
Streaming
Streaming
Streaming data sources
Static data sources
29. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Data can be ingested from many sources and the processed data
can be pushed out to different filesystems
Streaming
Streaming
Streaming data sources
Static data sources
Data storage
30. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Spark Streaming receives live input data streams and divides the
data into batches, which are then processed by the Spark engine to
generate the final stream of results in batches
Streaming Engine
Input data
stream
Batches of
input data
Batches of
processed
data
Streaming
31. Spark Streaming
Spark Streaming an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
Streaming
Here is an example of a basic RDD operation to extract individual
words from lines of text in an input data stream
Lines From
Time 0 and 1
Lines From
Time 1 and 2
Lines From
Time 2 and 3
Lines From
Time 3 and 4
Words From
Time 0 and 1
Words From
Time 1 and 2
Words From
Time 2 and 3
Words From
Time 3 and 4
Lines
DStream
Words
DStream
flatMap
Operation
32. Spark MLlib
MLlib is Spark’s machine learning library. Its goal is to make practical machine learning
scalable and easy
MLlib
33. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning
scalable and easy
MLlib
At a high level, it provides the following:
ML Algorithms: classification, regression, clustering, and
collaborative filtering
Spark MLlib
34. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning
scalable and easy
MLlib
At a high level, it provides the following:
ML Algorithms: classification, regression, clustering, and
collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Spark MLlib
35. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning
scalable and easy
MLlib
At a high level, it provides the following:
ML Algorithms: classification, regression, clustering, and
collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
pipelines
Spark MLlib
36. MLlib is Spark’s machine learning library. Its goal is to make practical machine learning
scalable and easy
MLlib
At a high level, it provides the following:
ML Algorithms: classification, regression, clustering, and
collaborative filtering
Featurization: feature extraction, transformation,
dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML
pipelines
Utilities: linear algebra, statistics, data handling
Spark MLlib
37. GraphX
GraphX is a component in Spark for graphs and graph-parallel computation
GraphX is used to model relations between objects. A graph has vertices
(objects) and edges (relationships).
Mathew Justin
Edge
Vertex
Relationship: Friends
38. GraphX
GraphX is a component in Spark for graphs and graph-parallel computation
Provides a uniform tool
for ETL
Exploratory data
analysis
Interactive graph
computations
39. GraphX is a component in Spark for graphs and graph-parallel computation
Page Rank
Fraud
Detection
Geographic
information system
Disaster
management
Following are the applications of GraphX
GraphX
42. Spark Architecture
Spark Architecture is based on 2 important abstractions
Resilient Distributed Dataset
(RDD)
RDD’s are the fundamental units of data in Apache
Spark that are split into partitions and can be executed
on different nodes of a cluster
Cluster
RDD
43. Spark Architecture
Spark Architecture is based on 2 important abstractions
Resilient Distributed Dataset
(RDD)
Directed Acyclic Graph
(DAG)
RDD’s are the fundamental units of data in Apache
Spark that are split into partitions and can be executed
on different nodes of a cluster
Cluster
DAG is the scheduling layer of the Spark
Architecture that implements stage-oriented
scheduling and eliminates the Hadoop MapReduce
multistage execution model
RDD
Stage 1
Parallelize
Filter
Map
Stage 2
reduceByKey
Map
44. Spark Architecture
Master Node
Driver Program
SparkContext
• Master Node has a Driver Program
• The Spark code behaves as a driver
program and creates a SparkContext
which is a gateway to all the Spark
functionalities
Apache Spark uses a master-slave architecture that consists of a driver, that runs on a
master node, and multiple executors which run across the worker nodes in the cluster
45. Spark Architecture
Cluster Manager
• Spark applications run as independent
sets of processes
on a cluster
• The driver program & Spark context
takes care of the job execution within
the cluster
Master Node
Driver Program
SparkContext
46. Spark Architecture
Cache
Task Task
Executor
Worker Node
Cache
Task Task
Executor
Worker Node
• A job is split into multiple tasks that are
distributed over the worker node
• When an RDD is created in Spark
context, it can be distributed across
various nodes
• Worker nodes are slaves that execute
different tasks
Cluster Manager
Master Node
Driver Program
SparkContext
47. Spark Architecture
Cache
Task Task
Executor
Worker Node
Cache
Task Task
Executor
Worker Node
• Executor is responsible for the
execution of these tasks
• Worker nodes execute the tasks
assigned by the Cluster Manager and
returns the resultback to the Spark
Context
Master Node
Driver Program
SparkContext Cluster Manager
48. Spark Architecture
Cache
Task Task
Executor
Worker Node
Cache
Task Task
Executor
Worker Node
• Worker nodes execute the tasks
assigned by the Cluster Manager and
returns it back to the Spark Context
• Executor is responsible for the
execution of these tasks
Master Node
Driver Program
SparkContext Cluster Manager
50. Spark Session
Driver Program
Application
How a Spark application runs on a cluster?
Spark applications run as independent processes, coordinated by the
SparkSession object in the driver program
51. Spark Session
Driver Program
Application
Resource Manager/
Cluster Manager
How a Spark application runs on a cluster?
The resource or cluster manager assigns tasks to workers,
one task per partition
52. Spark Session
Driver Program
Application
Worker Node
Executor
Task
Task
Cache
Partition
Partition
Disk
Data
Data
How a Spark application runs on a cluster?
Resource Manager/
Cluster Manager
• A task applies its unit of work to the dataset in its
partition and outputs a new partition dataset
• Because iterative algorithms apply operations
repeatedly to data, they benefit from caching datasets
across iterations
53. How a Spark application runs on a cluster?
Spark Session
Driver Program
Application
Executor
Task
Task
Cache
Partition
Partition
Disk
Data
Data
Resource Manager/
Cluster Manager
Results are sent back to the driver application or
can be saved to disk
Worker Node