What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework

•Télécharger en tant que PPTX, PDF•

2 j'aime•950 vues

Apache Spark is an open source big data processing framework that is faster than Hadoop, easier to use, and supports more types of analytics. It provides high-level APIs, can run computations directly in memory for faster performance, and supports a variety of data processing workloads including SQL queries, streaming data, machine learning, and graph processing. Spark also has a large ecosystem of additional libraries and tools that expand its capabilities.

Formation

WHAT IS SPARK
Apache Spark is an open source big data
processing framework built around speed, ease
of use, and sophisticated analytics. It was
originally developed in 2009 in UC Berkeley’s
AMPLab, and open sourced in 2010 as an
Apache project.
B I G D A T A W O R K G R O U P . I R

WHAT IS SPARK
Advantages: In Memory
 Spark enables applications in Hadoop clusters to run up to 100
times faster in memory and 10 times faster even when running
on disk.
B I G D A T A W O R K G R O U P . I R

WHAT IS SPARK
Advantages: Generic API
 Spark lets you quickly write applications in Java, Scala, or
Python. It comes with a built-in set of over 80 high-level
operators. And you can use it interactively to query data within
the shell.
B I G D A T A W O R K G R O U P . I R

WHAT IS SPARK
Advantages: Many Applications
 Spark gives us a comprehensive, unified framework to manage
big data processing requirements with a variety of data sets
that are diverse in nature (text data, graph data etc) as well as
the source of data (batch v. real-time streaming data).
B I G D A T A W O R K G R O U P . I R

WHAT IS SPARK
Advantages: Many Applications
 In addition to Map and Reduce operations, it supports SQL
queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone
or combine them to run in a single data pipeline use case.
B I G D A T A W O R K G R O U P . I R

HADOOP AND SPARK
Hadoop Spark
Map & Reduce -> suitable for on-
pass computations
multi-step data pipelines using
directed acyclic graph (DAG)
pattern.
Clusters are hard to set up and
manage
supports in-memory data sharing
across DAGs.
need to integrate with Mahout
(Machine Learning) and Storm
(Streaming data processing)
Spark as an alternative to Hadoop
MapReduce
B I G D A T A W O R K G R O U P . I R

SPARK FEATURES
Less expensive shuffles in the data processing. With capabilities like in-
memory data storage
Lazy evaluation of big data queries, which helps with optimization of the
steps in data processing workflows.
Higher level API to improve developer productivity and a consistent
architect model for big data solutions.
B I G D A T A W O R K G R O U P . I R

SPARK FEATURES
Spark holds intermediate results in memory rather than writing them to
disk
Spark can be used for processing datasets that larger than the aggregate
memory in a cluster.
B I G D A T A W O R K G R O U P . I R

SPARK ECOSYSTEM
Spark Streaming
 micro batch style of computing and processing.(DStream)
Spark SQL
 JDBC API, SQL like queries, ETL
Spark Mlib
 including classification, regression, clustering, collaborative filtering,
dimensionality reduction, as well as underlying optimization primitives
B I G D A T A W O R K G R O U P . I R

SPARK ECOSYSTEM
Spark GraphX
GraphX extends the Spark RDD by introducing the
Resilient Distributed Property Graph
Set of fundamental operators (e.g., subgraph,
joinVertices, and aggregateMessages)
B I G D A T A W O R K G R O U P . I R

SPARK ECOSYSTEM
BlinkDB
trade-off query accuracy for response time.
Tachyon
Caches working set files in memory
Spark Cassandra Connector
access data stored in a Cassandra database
SparkR
B I G D A T A W O R K G R O U P . I R

SPARK ARCHITECTURE
B I G D A T A W O R K G R O U P . I R

RESILIENT DISTRIBUTED DATASETS
Fault tolerance because an RDD know how to recreate and re-compute the
datasets.
RDDs are immutable.
B I G D A T A W O R K G R O U P . I R

RDD OPERATIONS
B I G D A T A W O R K G R O U P . I R

HOW TO RUN SPARK
B I G D A T A W O R K G R O U P . I R

HOW TO INTERACT WITH SPARK
spark-shell.cmd
B I G D A T A W O R K G R O U P . I R

SPARK WEB CONSOLE
http://localhost:4040
B I G D A T A W O R K G R O U P . I R

SHARED VARIABLES
Broadcast Variables
Accumulators
B I G D A T A W O R K G R O U P . I R

SPARK ECOSYSTEM
Spark SQL
 JDBC API, SQL like queries, ETL
B I G D A T A W O R K G R O U P . I R

SPARK ECOSYSTEM
Spark Streaming
 micro batch style of computing and processing.(DStream)
B I G D A T A W O R K G R O U P . I R

Contenu connexe

Tendances

Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan

Introduction to Bigdata and HADOOP vinoth kumar

What is hadoopAsis Mohanty

Big data hadoop rdbmsArjen de Vries

Big data Analytics HadoopMishika Bharadwaj

Hadoop core conceptsMaryan Faryna

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi

Hadoop: Distributed Data ProcessingCloudera, Inc.

Hadoop PresentationPham Thai Hoa

Understanding hdfsThirunavukkarasu Ps

Big Data and Hadoop IntroductionDzung Nguyen

Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)

Hadoop and Big DataHarshdeep Kaur

Apache Hadoop at 10Cloudera, Inc.

Big Data & Hadoop TutorialEdureka!

Hadoop Family and Ecosystemtcloudcomputing-tw

Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training

Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal

Emergent Distributed Data Storagehybrid cloud

Big Data and Hadoop BasicsSonal Tiwari

Tendances (20)

Introduction to Big Data Analytics on Apache Hadoop

Introduction to Bigdata and HADOOP

What is hadoop

Big data hadoop rdbms

Big data Analytics Hadoop

Hadoop core concepts

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...

Hadoop: Distributed Data Processing

Hadoop Presentation

Understanding hdfs

Big Data and Hadoop Introduction

Introduction to Apache Hadoop Eco-System

Hadoop and Big Data

Apache Hadoop at 10

Big Data & Hadoop Tutorial

Hadoop Family and Ecosystem

Top Hadoop Big Data Interview Questions and Answers for Fresher

Introduction to Big Data & Hadoop Architecture - Module 1

Emergent Distributed Data Storage

Big Data and Hadoop Basics

Similaire à What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework

Apache Spark PDFNaresh Rupareliya

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Apache Spark Introduction.pdfMaheshPandit16

Apache sparkDona Mary Philip

Spark from the SurfaceJosi Aranda

Introduction to sparkHome

Spark SQL | Apache SparkEdureka!

Big Data Processing With SparkEdureka!

Spark For Faster Batch ProcessingEdureka!

spark interview questions & answers acadgild blogsprateek kumar

Machine Learning with SparkROlgun Aydın

5 things one must know about spark!Edureka!

5 Reasons why Spark is in demand!Edureka!

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!

Apache Spark OverviewDharmjit Singh

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Unit II Real Time Data Processing tools.pptxRahul Borate

Using pySpark with Google Colab & Spark 3.0 previewMario Cartia

Apache sparkPrashant Pranay

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!

Similaire à What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework (20)

Apache Spark PDF

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

Apache Spark Introduction.pdf

Apache spark

Spark from the Surface

Introduction to spark

Spark SQL | Apache Spark

Big Data Processing With Spark

Spark For Faster Batch Processing

spark interview questions & answers acadgild blogs

Machine Learning with SparkR

5 things one must know about spark!

5 Reasons why Spark is in demand!

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

Apache Spark Overview

Processing Large Data with Apache Spark -- HasGeek

Unit II Real Time Data Processing tools.pptx

Using pySpark with Google Colab & Spark 3.0 preview

Apache spark

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

Dernier

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"National Information Standards Organization (NISO)

The Most Excellent Way | 1 Corinthians 13Steve Thomason

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxRAM LAL ANAND COLLEGE, DELHI UNIVERSITY.

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

Student login on Anyboli platform.helpinRaunakKeshri1

Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622

Mastering the Unannounced Regulatory InspectionSafetyChain Software

Sports & Fitness Value Added Course FY..Disha Kariya

Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

mini mental status format.docxPoojaSen20

microwave assisted reaction. General introductionMaksud Ahmed

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Software Engineering Methodologies (overview)eniolaolutunde

Dernier (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"

The Most Excellent Way | 1 Corinthians 13

Beyond the EU: DORA and NIS 2 Directive's Global Impact

social pharmacy d-pharm 1st year by Pragati K. Mahajan

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx

1029-Danh muc Sach Giao Khoa khoi 6.pdf

Grant Readiness 101 TechSoup and Remy Consulting

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

CARE OF CHILD IN INCUBATOR..........pptx

Student login on Anyboli platform.helpin

Disha NEET Physics Guide for classes 11 and 12.pdf

Mastering the Unannounced Regulatory Inspection

Sports & Fitness Value Added Course FY..

Separation of Lanthanides/ Lanthanides and Actinides

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

mini mental status format.docx

microwave assisted reaction. General introduction

The basics of sentences session 2pptx copy.pptx

Software Engineering Methodologies (overview)

What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework

1. B I G D A T A W O R K G R O U P . I R

2. WHAT IS SPARK Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. B I G D A T A W O R K G R O U P . I R

3. WHAT IS SPARK Advantages: In Memory  Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk. B I G D A T A W O R K G R O U P . I R

4. WHAT IS SPARK Advantages: Generic API  Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell. B I G D A T A W O R K G R O U P . I R

5. WHAT IS SPARK Advantages: Many Applications  Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data). B I G D A T A W O R K G R O U P . I R

6. WHAT IS SPARK Advantages: Many Applications  In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. B I G D A T A W O R K G R O U P . I R

7. HADOOP AND SPARK Hadoop Spark Map & Reduce -> suitable for on- pass computations multi-step data pipelines using directed acyclic graph (DAG) pattern. Clusters are hard to set up and manage supports in-memory data sharing across DAGs. need to integrate with Mahout (Machine Learning) and Storm (Streaming data processing) Spark as an alternative to Hadoop MapReduce B I G D A T A W O R K G R O U P . I R

8. SPARK FEATURES Less expensive shuffles in the data processing. With capabilities like in- memory data storage Lazy evaluation of big data queries, which helps with optimization of the steps in data processing workflows. Higher level API to improve developer productivity and a consistent architect model for big data solutions. B I G D A T A W O R K G R O U P . I R

9. SPARK FEATURES Spark holds intermediate results in memory rather than writing them to disk Spark can be used for processing datasets that larger than the aggregate memory in a cluster. B I G D A T A W O R K G R O U P . I R

10. SPARK ECOSYSTEM Spark Streaming  micro batch style of computing and processing.(DStream) Spark SQL  JDBC API, SQL like queries, ETL Spark Mlib  including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives B I G D A T A W O R K G R O U P . I R

11. SPARK ECOSYSTEM Spark GraphX GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph Set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) B I G D A T A W O R K G R O U P . I R

12. SPARK ECOSYSTEM BlinkDB trade-off query accuracy for response time. Tachyon Caches working set files in memory Spark Cassandra Connector access data stored in a Cassandra database SparkR B I G D A T A W O R K G R O U P . I R

13. B I G D A T A W O R K G R O U P . I R

14. SPARK ARCHITECTURE B I G D A T A W O R K G R O U P . I R

15. RESILIENT DISTRIBUTED DATASETS Fault tolerance because an RDD know how to recreate and re-compute the datasets. RDDs are immutable. B I G D A T A W O R K G R O U P . I R

16. RDD OPERATIONS B I G D A T A W O R K G R O U P . I R

17. HOW TO RUN SPARK B I G D A T A W O R K G R O U P . I R

18. HOW TO INTERACT WITH SPARK spark-shell.cmd B I G D A T A W O R K G R O U P . I R

19. SPARK WEB CONSOLE http://localhost:4040 B I G D A T A W O R K G R O U P . I R

20. SHARED VARIABLES Broadcast Variables Accumulators B I G D A T A W O R K G R O U P . I R

21. SPARK ECOSYSTEM Spark SQL  JDBC API, SQL like queries, ETL B I G D A T A W O R K G R O U P . I R

22. SPARK ECOSYSTEM Spark Streaming  micro batch style of computing and processing.(DStream) B I G D A T A W O R K G R O U P . I R

What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework

Similaire à What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework (20)

Dernier

Dernier (20)

What is Apache Spark? Key Features and Advantages of the Popular Big Data Processing Framework