SlideShare une entreprise Scribd logo
1  sur  37
Real-time Processing Systems
Apache Spark
1
Apache Spark
• Apache Spark is a lightning-fast cluster computing designed for fast
computation
• It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
which includes Interactive Queries and Stream Processing
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management
• Spark uses Hadoop in two ways – one is storage and second is
processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only
2
Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools
3
Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk. This is possible by reducing number of read/write operations to
disk. It stores the intermediate processing data in memory
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms
4
Components of Spark
• The following illustration depicts the different components of Spark
Apache Spark Core
• Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems
5
Components of Spark
Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data
Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data
MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout
GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API
6
Spark Architecture
Spark Architecture includes following three main components:
• Data Storage
• API
• Management Framework
Data Storage:
• Spark uses HDFS file system for data storage purposes. It works with
any Hadoop compatible data source including HDFS, HBase,
Cassandra, etc.
7
Spark Architecture
API:
• The API provides the application developers to create Spark based
applications using a standard API interface. Spark provides API for
Scala, Java, and Python programming languages
Resource Management:
• Spark can be deployed as a Stand-alone server or it can be on a
distributed computing framework like Mesos or YARN
8
Resilient Distributed Datasets
• Resilient Distributed Datasets is the core concept in Spark framework
• Spark stores data in RDD on different partitions
• They help with rearranging the computations and optimizing the data
processing
• They are also fault tolerance because an RDD know how to recreate
and recompute the datasets
• RDDs are immutable. You can modify an RDD with a transformation
but the transformation returns you a new RDD whereas the original
RDD remains the same
9
Resilient Distributed Datasets
• It provides API for various transformations and materializations of
data as well as for control over caching and partitioning of elements
to optimize data placement
• RDD can be created either from external storage or from another RDD
and stores information about its parents recompute partition in case
of failure
10
Resilient Distributed Datasets
RDD supports two types of operations:
• Transformation: Transformations don't return a single value, they return a
new RDD. Nothing gets evaluated when you call a Transformation function,
it just takes an RDD and return a new RDD
• Some of the Transformation functions are map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, pipe, and coalesce
• Action: Action operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing queries
are computed at that time and the result value is returned
• Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach
11
RDD Persistence
• One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations
• When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset. This allows future actions to be much faster
• Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
12
Components
13
Components
• Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in main program (called the driver
program)
• The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN), which allocate
resources across applications
• Spark acquires executors on nodes in the cluster, which are processes that
run computations and store data for application
• Next, it sends application code (defined by JAR or Python files passed to
SparkContext) to the executors
• Finally, SparkContext sends tasks to the executors to run
14
Components
There are several useful things to note about this architecture:
• Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads
• The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes
• Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network
15
Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP
sockets, and can be processed using complex algorithms expressed with
high-level functions like map, reduce, join and window
• Finally, processed data can be pushed out to filesystems
16
Spark Streaming
• The way Spark Streaming works is it divides the live stream of data
into batches (called micro batches) of a pre-defined interval (N
seconds) and then treats each batch of data as RDDs
• It's important to decide the time interval for Spark Streaming, based
on your use case and data processing requirements
• If the value of N is too low, then the micro-batches will not have
enough data to give meaningful results during the analysis
17
Spark Streaming
Figure . How Spark Streaming works
18
Spark Streaming
• Spark Streaming receives live input data streams and divides the data
into batches, which are then processed by the Spark engine to
generate the final stream of results in batches
• Spark Streaming provides a high-level abstraction called discretized
stream or DStream, which represents a continuous stream of data.
Internally, a DStream is represented as a sequence of RDDs
19
Discretized Streams (DStreams)
• It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated
by transforming the input stream
• Internally, a DStream is represented by a continuous series of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a DStream contains data from a certain interval
20
Spark runtime components
21
Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue
boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots
are in white boxes.
Responsibilities of the client process
component
• The client process starts the driver program
• For example, the client process can be a spark-submit script for
running applications, a spark-shell script, or a custom application
using Spark API
• The client process prepares the class path and all configuration
options for the Spark application
• It also passes application arguments, if any, to the application running
inside the driver
22
Responsibilities of the driver component
• The driver orchestrates and monitors execution of a Spark application
• There’s always one driver per Spark application
• The Spark context and scheduler – are responsible for:
• Requesting memory and CPU resources from cluster managers
• Breaking application logic into stages and tasks
• Sending tasks to executors
• Collecting the results
23
Responsibilities of the driver component
24
Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s
JVM process.
Responsibilities of the driver component
Two basic ways the driver program can be run are:
• Cluster deploy mode is depicted in figure 1. In this mode, the driver
process runs as a separate JVM process inside a cluster, and the
cluster manages its resources
• Client deploy mode is depicted in figure 2. In this mode, the driver’s
running inside the client’s JVM process and communicates with the
executors managed by the cluster
25
Responsibilities of the executors
• The executors, which JVM processes, accept tasks from the driver,
execute those tasks, and return the results to the driver
• Each executor has several task slots (or CPU cores) for running tasks in
parallel
• Although these task slots are often referred to as CPU cores in Spark,
they’re implemented as threads and don’t need to correspond to the
number of physical CPU cores on the machine
26
Creation of the Spark context
• Once the driver’s started, it configures an instance of SparkContext
• When running a standalone Spark application by submitting a jar file,
or by using Spark API from another program, your Spark application
starts and configures the Spark context
• There can be only one Spark context per JVM
27
High-level architecture
• Spark provides a well-defined and layered architecture where all its
layers and components are loosely coupled and integration with
external components/libraries/extensions is performed using well-
defined contracts
28
High-level architecture
• Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These
nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage.
• Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark
jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster
memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold
the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as
local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch.
• Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated
applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation
of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the
APIs that are used to request for the allocation and deallocation of available resource across the cluster.
• Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the
Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a
wide variety of applications and languages.
• Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the
Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to
perform ad hoc queries and interactive analysis over large datasets.
29
Spark execution model – master worker view
31
Spark execution model – master worker view
• Spark is built around the concepts of Resilient Distributed Datasets
and Direct Acyclic Graph representing transformations and
dependencies between them
32
Spark execution model – master worker view
• Spark Application (often referred to as Driver Program or Application
Master) at high level consists of SparkContext and user code which
interacts with it creating RDDs and performing series of
transformations to achieve final result
• These transformations of RDDs are then translated into DAG and
submitted to Scheduler to be executed on set of worker nodes
33
Execution workflow
• User code containing RDD transformations forms Direct Acyclic Graph
which is then split into stages of tasks by DAGScheduler
• Tasks run on workers and results then return to client
34
Execution workflow
37
Execution workflow
• SparkContext
• represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast
variables on that cluster
• DAGScheduler
• computes a DAG of stages for each job and submits them to TaskScheduler
• determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum
schedule to run the jobs
• TaskScheduler
• responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
• backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN,
Standalone, local)
• BlockManager
• provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory,
disk, and off-heap)
38
Reference
• Data Stream Management Systems: Apache Spark Streaming
• http://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime-
architecture/
• https://www.packtpub.com/books/content/spark-%E2%80%93-architecture-and-
first-program
• https://0x0fff.com/spark-architecture/
• http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/
• https://github.com/apache/spark
• http://spark.apache.org/docs/latest/
• https://github.com/JerryLead/SparkInternals
39
THANKS !
40

Contenu connexe

Tendances

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceRobert Evans
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache StormP. Taylor Goetz
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs stormTrong Ton
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9Gal Marder
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNDataWorks Summit
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Humoyun Ahmedov
 

Tendances (20)

Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Storm
StormStorm
Storm
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
What’s expected in Java 9
What’s expected in Java 9What’s expected in Java 9
What’s expected in Java 9
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 

Similaire à Apache Spark

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache SparkDona Mary Philip
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 

Similaire à Apache Spark (20)

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache spark
Apache sparkApache spark
Apache spark
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 

Dernier

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Dernier (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Apache Spark

  • 2. Apache Spark • Apache Spark is a lightning-fast cluster computing designed for fast computation • It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing • Spark is not a modified version of Hadoop and is not, really, dependent on Hadoop because it has its own cluster management • Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its own cluster management computation, it uses Hadoop for storage purpose only 2
  • 3. Apache Spark • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application • Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming • Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools 3
  • 4. Features of Apache Spark • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms 4
  • 5. Components of Spark • The following illustration depicts the different components of Spark Apache Spark Core • Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems 5
  • 6. Components of Spark Spark SQL • Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data Spark Streaming • Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data MLlib (Machine Learning Library) • MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout GraphX • GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API 6
  • 7. Spark Architecture Spark Architecture includes following three main components: • Data Storage • API • Management Framework Data Storage: • Spark uses HDFS file system for data storage purposes. It works with any Hadoop compatible data source including HDFS, HBase, Cassandra, etc. 7
  • 8. Spark Architecture API: • The API provides the application developers to create Spark based applications using a standard API interface. Spark provides API for Scala, Java, and Python programming languages Resource Management: • Spark can be deployed as a Stand-alone server or it can be on a distributed computing framework like Mesos or YARN 8
  • 9. Resilient Distributed Datasets • Resilient Distributed Datasets is the core concept in Spark framework • Spark stores data in RDD on different partitions • They help with rearranging the computations and optimizing the data processing • They are also fault tolerance because an RDD know how to recreate and recompute the datasets • RDDs are immutable. You can modify an RDD with a transformation but the transformation returns you a new RDD whereas the original RDD remains the same 9
  • 10. Resilient Distributed Datasets • It provides API for various transformations and materializations of data as well as for control over caching and partitioning of elements to optimize data placement • RDD can be created either from external storage or from another RDD and stores information about its parents recompute partition in case of failure 10
  • 11. Resilient Distributed Datasets RDD supports two types of operations: • Transformation: Transformations don't return a single value, they return a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD • Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce • Action: Action operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned • Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach 11
  • 12. RDD Persistence • One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations • When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset. This allows future actions to be much faster • Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it 12
  • 14. Components • Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in main program (called the driver program) • The SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications • Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for application • Next, it sends application code (defined by JAR or Python files passed to SparkContext) to the executors • Finally, SparkContext sends tasks to the executors to run 14
  • 15. Components There are several useful things to note about this architecture: • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads • The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network 15
  • 16. Spark Streaming • Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams • Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window • Finally, processed data can be pushed out to filesystems 16
  • 17. Spark Streaming • The way Spark Streaming works is it divides the live stream of data into batches (called micro batches) of a pre-defined interval (N seconds) and then treats each batch of data as RDDs • It's important to decide the time interval for Spark Streaming, based on your use case and data processing requirements • If the value of N is too low, then the micro-batches will not have enough data to give meaningful results during the analysis 17
  • 18. Spark Streaming Figure . How Spark Streaming works 18
  • 19. Spark Streaming • Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. Internally, a DStream is represented as a sequence of RDDs 19
  • 20. Discretized Streams (DStreams) • It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream • Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset • Each RDD in a DStream contains data from a certain interval 20
  • 21. Spark runtime components 21 Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots are in white boxes.
  • 22. Responsibilities of the client process component • The client process starts the driver program • For example, the client process can be a spark-submit script for running applications, a spark-shell script, or a custom application using Spark API • The client process prepares the class path and all configuration options for the Spark application • It also passes application arguments, if any, to the application running inside the driver 22
  • 23. Responsibilities of the driver component • The driver orchestrates and monitors execution of a Spark application • There’s always one driver per Spark application • The Spark context and scheduler – are responsible for: • Requesting memory and CPU resources from cluster managers • Breaking application logic into stages and tasks • Sending tasks to executors • Collecting the results 23
  • 24. Responsibilities of the driver component 24 Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s JVM process.
  • 25. Responsibilities of the driver component Two basic ways the driver program can be run are: • Cluster deploy mode is depicted in figure 1. In this mode, the driver process runs as a separate JVM process inside a cluster, and the cluster manages its resources • Client deploy mode is depicted in figure 2. In this mode, the driver’s running inside the client’s JVM process and communicates with the executors managed by the cluster 25
  • 26. Responsibilities of the executors • The executors, which JVM processes, accept tasks from the driver, execute those tasks, and return the results to the driver • Each executor has several task slots (or CPU cores) for running tasks in parallel • Although these task slots are often referred to as CPU cores in Spark, they’re implemented as threads and don’t need to correspond to the number of physical CPU cores on the machine 26
  • 27. Creation of the Spark context • Once the driver’s started, it configures an instance of SparkContext • When running a standalone Spark application by submitting a jar file, or by using Spark API from another program, your Spark application starts and configures the Spark context • There can be only one Spark context per JVM 27
  • 28. High-level architecture • Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well- defined contracts 28
  • 29. High-level architecture • Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage. • Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch. • Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster. • Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages. • Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets. 29
  • 30. Spark execution model – master worker view 31
  • 31. Spark execution model – master worker view • Spark is built around the concepts of Resilient Distributed Datasets and Direct Acyclic Graph representing transformations and dependencies between them 32
  • 32. Spark execution model – master worker view • Spark Application (often referred to as Driver Program or Application Master) at high level consists of SparkContext and user code which interacts with it creating RDDs and performing series of transformations to achieve final result • These transformations of RDDs are then translated into DAG and submitted to Scheduler to be executed on set of worker nodes 33
  • 33. Execution workflow • User code containing RDD transformations forms Direct Acyclic Graph which is then split into stages of tasks by DAGScheduler • Tasks run on workers and results then return to client 34
  • 35. Execution workflow • SparkContext • represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster • DAGScheduler • computes a DAG of stages for each job and submits them to TaskScheduler • determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum schedule to run the jobs • TaskScheduler • responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers • SchedulerBackend • backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN, Standalone, local) • BlockManager • provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap) 38
  • 36. Reference • Data Stream Management Systems: Apache Spark Streaming • http://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime- architecture/ • https://www.packtpub.com/books/content/spark-%E2%80%93-architecture-and- first-program • https://0x0fff.com/spark-architecture/ • http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/ • https://github.com/apache/spark • http://spark.apache.org/docs/latest/ • https://github.com/JerryLead/SparkInternals 39