SlideShare une entreprise Scribd logo
1  sur  17
I n t r o d u c t i o n t o A p a c h e
S p a r k
P R E P A R E D B Y
G N A G A R A J A N
Introduction to Apache Spark
O U T L I N E
• What is Spark?
• Why Spark important in business analytics?
• Which Industries Make Use of Spark?
• Limitations of Spark
• Pros and Cons
• Comparison between Spark and MapReduce
• Conclusion
W h a t i s S p a r k ?
• Apache Spark is a lightning-fast cluster computing platform intended for high-
performance computing. It is based on Hadoop MapReduce and extends the
MapReduce paradigm to be used effectively for other kinds of calculations, such
as interactive queries and stream processing. The primary feature of Spark is its
in-memory cluster computing, which improves an application's processing
performance.
• Spark is intended to support various workloads, including batch applications,
iterative algorithms, interactive queries, and streaming. It supports all of these
workloads in a single system and lowers the administrative effort of maintaining
different tools.
W h y s p a r k i m p o r t a n t i n
b u s i n e s s a n a l y t i c s ?
Apache Spark, the unified analytics engine, has
experienced fast adoption by businesses across a broad
variety of sectors since its introduction. Internet
behemoths like Netflix, Yahoo, and eBay have used Spark
on a huge scale, processing several petabytes of data
across clusters of over 8,000 nodes.
W h e n i t c o m e s t o b u s i n e s s
a n a l y t i c s , w h y i s s p a r k s o
i m p o r t a n t ?
1. Spark enables use cases “traditional” Hadoop can’t handle
As an extension of Hadoop MapReduce's batch model, Spark utilizes in-
memory distributed computing to offer features like streaming processing,
machine learning, graph computing, and interactive analytics that are not
possible with the batch model. Because of this, new data science applications
that were previously too costly or slow to run on large data sets are now
available in the big data world.
2. Spark is fast
Spark is orders of magnitude quicker than current Hadoop installations at
running analytics. It results in improved interaction, experimentation speed,
and analyst productivity.
c o n t …
3. Spark can use your existing big data investment
When Hadoop came along, businesses invested in new compute clusters to
use the technology. That is not the case with Spark: it can be utilised on top of
current Hadoop investments to implement new functionality rapidly.
Additionally, Spark is very compatible with the Hadoop universe: it can access
data stored in HDFS and operate on top of Hadoop 2.0's YARN. Spark is
compatible with Cassandra and Amazon's S3 storage in addition to Hadoop.
4. Spark speaks SQL
SQL is the de facto standard for structured data. Spark's SQL module enables
incorporating current data sources, such as Hive, into computations and the
extension of existing investments in business intelligence tools to big data.
Spark SQL is still in its infancy compared to other large data SQL
implementations, but it is gaining traction.
5. Spark is developer-friendly
Never underestimate the power of easy-to-use technology. Despite being built
on a new programming language, Scala, developers love how concise and
fluid it is. The Hadoop language, Java, is supported, as is Python, the data
scientist's favourite.
6. Open Source: Free to download plus large apache community support.
7. Fault Tolerant: Apache spark RDD is an immutable dataset, each spark
8. Supports processing variety of Data: Structured, semi-structured
c o n t …
W h i c h I n d u s t r i e s M a k e U s e o f
S p a r k ?
• Apache Spark, the unified analytics engine, has experienced fast adoption by
businesses across a broad variety of sectors since its introduction. Internet
behemoths like Netflix, Yahoo, and eBay have used Spark on a huge scale,
processing several petabytes of data across clusters of over 8,000 nodes.
• In the gaming sector, Apache Spark is used to detect patterns in real-time in-
game events and react to them in order to harvest profitable economic
possibilities such as targeted advertising, auto-adjustment of gaming levels
depending on complexity, player retention, and many more.
L i m i t a t i o n s o f S p a r k
1. No File Management system : There is no built-in file system for managing files
in Spark.
2. No Support for Real-Time Processing: Spark does not support complete Real-
Time Processing.
3. Manual Optimization :In Spark, the task must be optimized manually. It is
sufficient for some datasets. If we wish to create partitions, we may do it by
manually creating multiple spark partitions. To choose independently, we must
provide a number as the second argument to parallelize.
4. Less number of Algorithms: There are less algorithms in Apache Spark
Machine Learning Spark MLlib. It falls behind a number of available algorithms.
As an example, consider the Tanimoto distance..
F e a t u r e s o f S p a r k
Apache Spark has following features.
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times
faster in memory, and 10 times faster when running on disk. This is possible by
reducing number of read/write operations to disk. It stores the intermediate
processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark
comes up with 80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
C o n t …
• Stream Processing: Spark supports stream processing in real-time. The
problem in the earlier MapReduce framework was that it could process only
already existing data.
• Lazy Evaluation: Spark transformations done using Spark RDDs are lazy.
Meaning, they do not generate results right away, but they create new RDDs
from existing RDD. This lazy evaluation increases the system efficiency.
• Support Multiple Languages: Spark supports multiple languages like R, Scala, Python,
Java which provides dynamicity and helps in overcoming the Hadoop limitation of
application development only using Java.
• Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby
making it flexible.
C o n t …
• Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine
learning, etc.
• Cost Efficiency: Apache Spark is considered a better cost-efficient solution when
compared to Hadoop as Hadoop required large storage and data centers while data
processing and replication.
• Active Developer’s Community: Apache Spark has a large developers base involved in
continuous development. It is considered to be the most important project undertaken by
the Apache community.
p r o s a n d c o n s i n S p a r k
Pros Cons
Speed No automatic optimization process
Ease of Use File Management System
Advanced Analytics Fewer Algorithms
Dynamic in Nature Small Files Issue
Multilingual Window Criteria
Apache Spark is powerful
Doesn’t suit for a multi-user
environment
Increased access to Big data -
Demand for Spark Developers -
C o m p a r i s o n b e t w e e n S p a r k a n d
M a p r e d u c e
Apache Spark MapReduce
Spark processes data in batches as
well as in real-time
MapReduce processes data in batches
only
Spark runs almost 100 times faster
than Hadoop MapReduce
Hadoop MapReduce is slower when it
comes to large scale data processing
Spark stores data in the RAM i.e. in-
memory. So, it is easier to retrieve it
Hadoop MapReduce data is stored in
HDFS and hence takes a long time to
retrieve the data
Spark provides caching and in-memory
data storage
Hadoop is highly disk-dependent
C o n c l u s i o n
Apache Spark is a high-performance cluster computing platform that
extends the famous MapReduce paradigm to effectively handle additional
calculations, such as interactive searches and stream processing. Due to
Spark's strong interaction with other big data tools, this tight integration
enables applications that smoothly mix several computing models.
R E F E R E N C E S
• Big Data and Business Analytics, Jay Liebowitz, CRC Press
• Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau
• https://www.ibm.com/cloud/blog/hadoop-vs-spark
• https://data-flair.training/blogs/what-is-spark/
• https://techvidvan.com/tutorials/limitations-of-apache-spark/
Introduction to spark

Contenu connexe

Tendances

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 

Tendances (20)

Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 

Similaire à Introduction to spark

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkAegis Software Canada
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache sparkmanisha1110
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...rajeshseo5
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 

Similaire à Introduction to spark (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Detailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark FrameworkDetailed guide to the Apache Spark Framework
Detailed guide to the Apache Spark Framework
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Module01
 Module01 Module01
Module01
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Big data with java
Big data with javaBig data with java
Big data with java
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Spark 101
Spark 101Spark 101
Spark 101
 
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 

Dernier

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 

Dernier (20)

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

Introduction to spark

  • 1. I n t r o d u c t i o n t o A p a c h e S p a r k P R E P A R E D B Y G N A G A R A J A N Introduction to Apache Spark
  • 2. O U T L I N E • What is Spark? • Why Spark important in business analytics? • Which Industries Make Use of Spark? • Limitations of Spark • Pros and Cons • Comparison between Spark and MapReduce • Conclusion
  • 3. W h a t i s S p a r k ? • Apache Spark is a lightning-fast cluster computing platform intended for high- performance computing. It is based on Hadoop MapReduce and extends the MapReduce paradigm to be used effectively for other kinds of calculations, such as interactive queries and stream processing. The primary feature of Spark is its in-memory cluster computing, which improves an application's processing performance. • Spark is intended to support various workloads, including batch applications, iterative algorithms, interactive queries, and streaming. It supports all of these workloads in a single system and lowers the administrative effort of maintaining different tools.
  • 4. W h y s p a r k i m p o r t a n t i n b u s i n e s s a n a l y t i c s ? Apache Spark, the unified analytics engine, has experienced fast adoption by businesses across a broad variety of sectors since its introduction. Internet behemoths like Netflix, Yahoo, and eBay have used Spark on a huge scale, processing several petabytes of data across clusters of over 8,000 nodes.
  • 5. W h e n i t c o m e s t o b u s i n e s s a n a l y t i c s , w h y i s s p a r k s o i m p o r t a n t ? 1. Spark enables use cases “traditional” Hadoop can’t handle As an extension of Hadoop MapReduce's batch model, Spark utilizes in- memory distributed computing to offer features like streaming processing, machine learning, graph computing, and interactive analytics that are not possible with the batch model. Because of this, new data science applications that were previously too costly or slow to run on large data sets are now available in the big data world. 2. Spark is fast Spark is orders of magnitude quicker than current Hadoop installations at running analytics. It results in improved interaction, experimentation speed, and analyst productivity.
  • 6. c o n t … 3. Spark can use your existing big data investment When Hadoop came along, businesses invested in new compute clusters to use the technology. That is not the case with Spark: it can be utilised on top of current Hadoop investments to implement new functionality rapidly. Additionally, Spark is very compatible with the Hadoop universe: it can access data stored in HDFS and operate on top of Hadoop 2.0's YARN. Spark is compatible with Cassandra and Amazon's S3 storage in addition to Hadoop. 4. Spark speaks SQL SQL is the de facto standard for structured data. Spark's SQL module enables incorporating current data sources, such as Hive, into computations and the extension of existing investments in business intelligence tools to big data. Spark SQL is still in its infancy compared to other large data SQL implementations, but it is gaining traction.
  • 7. 5. Spark is developer-friendly Never underestimate the power of easy-to-use technology. Despite being built on a new programming language, Scala, developers love how concise and fluid it is. The Hadoop language, Java, is supported, as is Python, the data scientist's favourite. 6. Open Source: Free to download plus large apache community support. 7. Fault Tolerant: Apache spark RDD is an immutable dataset, each spark 8. Supports processing variety of Data: Structured, semi-structured c o n t …
  • 8. W h i c h I n d u s t r i e s M a k e U s e o f S p a r k ? • Apache Spark, the unified analytics engine, has experienced fast adoption by businesses across a broad variety of sectors since its introduction. Internet behemoths like Netflix, Yahoo, and eBay have used Spark on a huge scale, processing several petabytes of data across clusters of over 8,000 nodes. • In the gaming sector, Apache Spark is used to detect patterns in real-time in- game events and react to them in order to harvest profitable economic possibilities such as targeted advertising, auto-adjustment of gaming levels depending on complexity, player retention, and many more.
  • 9. L i m i t a t i o n s o f S p a r k 1. No File Management system : There is no built-in file system for managing files in Spark. 2. No Support for Real-Time Processing: Spark does not support complete Real- Time Processing. 3. Manual Optimization :In Spark, the task must be optimized manually. It is sufficient for some datasets. If we wish to create partitions, we may do it by manually creating multiple spark partitions. To choose independently, we must provide a number as the second argument to parallelize. 4. Less number of Algorithms: There are less algorithms in Apache Spark Machine Learning Spark MLlib. It falls behind a number of available algorithms. As an example, consider the Tanimoto distance..
  • 10. F e a t u r e s o f S p a r k Apache Spark has following features. • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory. • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying. • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
  • 11. C o n t … • Stream Processing: Spark supports stream processing in real-time. The problem in the earlier MapReduce framework was that it could process only already existing data. • Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they do not generate results right away, but they create new RDDs from existing RDD. This lazy evaluation increases the system efficiency. • Support Multiple Languages: Spark supports multiple languages like R, Scala, Python, Java which provides dynamicity and helps in overcoming the Hadoop limitation of application development only using Java. • Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby making it flexible.
  • 12. C o n t … • Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine learning, etc. • Cost Efficiency: Apache Spark is considered a better cost-efficient solution when compared to Hadoop as Hadoop required large storage and data centers while data processing and replication. • Active Developer’s Community: Apache Spark has a large developers base involved in continuous development. It is considered to be the most important project undertaken by the Apache community.
  • 13. p r o s a n d c o n s i n S p a r k Pros Cons Speed No automatic optimization process Ease of Use File Management System Advanced Analytics Fewer Algorithms Dynamic in Nature Small Files Issue Multilingual Window Criteria Apache Spark is powerful Doesn’t suit for a multi-user environment Increased access to Big data - Demand for Spark Developers -
  • 14. C o m p a r i s o n b e t w e e n S p a r k a n d M a p r e d u c e Apache Spark MapReduce Spark processes data in batches as well as in real-time MapReduce processes data in batches only Spark runs almost 100 times faster than Hadoop MapReduce Hadoop MapReduce is slower when it comes to large scale data processing Spark stores data in the RAM i.e. in- memory. So, it is easier to retrieve it Hadoop MapReduce data is stored in HDFS and hence takes a long time to retrieve the data Spark provides caching and in-memory data storage Hadoop is highly disk-dependent
  • 15. C o n c l u s i o n Apache Spark is a high-performance cluster computing platform that extends the famous MapReduce paradigm to effectively handle additional calculations, such as interactive searches and stream processing. Due to Spark's strong interaction with other big data tools, this tight integration enables applications that smoothly mix several computing models.
  • 16. R E F E R E N C E S • Big Data and Business Analytics, Jay Liebowitz, CRC Press • Learning Spark: Lightning-Fast Big Data Analysis, Holden Karau • https://www.ibm.com/cloud/blog/hadoop-vs-spark • https://data-flair.training/blogs/what-is-spark/ • https://techvidvan.com/tutorials/limitations-of-apache-spark/