The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.
Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark
Apache Spark: The Next Gen toolset for Big Data Processing
1. Prajod Vettiyattil
Architect, Open source
Wipro
in.linkedin.com/in/prajod
@prajods
Apache Spark The Next Gen toolset for Big Data Processing
Namitha M S
Architect, Advanced Technologies
Wipro
in.linkedin.com/in/namithams
Open Source India
Nov 2014
Bangalore
2. •Big Data
•Hadoop stack and its limitations
•Spark: An overview
•Streaming, GraphX and MLlib
•Performance characteristics of Spark
Agenda
3. •Data too huge for normal systems
•3 Vs: Volume, Variety, Velocity
•Storage challenge
•Analysis challenge
•Query results take hours, days or months
Big Data
Data disks
4. The Big Data Analysis Triad
Batch
Interactive
Streaming
5. The Hadoop stack
•Distributed data processing
•Fault tolerant
•Process peta byte data sets
•Ecosystem tools
•Hive DB, Hbase
•Pig
•Storm
•Hadoop
•Map
•Reduce
•Shuffle, partition, sort
•HDFS
6. Hadoop: Data flow
Partition for target reducers
Buffer in memory
Map
Input data files
Sort each partition by key
Merge all partitions and write to disk
Potential spill to disk
Merge round 1
Merge round 2
Merge round N
http fetch from
map node
Reduce
Merge sort
…
Output
High disk I/O
On Map nodes
On Reduce nodes
7. •Batch mode
•Only the batch layer in the Lambda pattern
•No real time
•No repetitive queries
•Iterative algorithms
•Interactive data querying
•Poor support for distributed memory
Limitations of Hadoop
8. Spark: An overview
•“Over time, fewer projects will use MapReduce, and more will use Spark”
•Doug Cutting, creator of Hadoop
•New architecture: scale better and simplify
•In memory processing for Big Data
•Cached intermediate data sets
•Multi-step DAG based execution
•Resilient Distributed Data(RDD) sets
•The core innovation in Spark
9. Spark Ecosystem tools
Apache Spark
Spark SQL
Streaming
MLlib
GraphX
Spark R
Blink DB
Shark
Bagel
10. DAG Execution Engine
Map
Collect
Filter
Map
Reduce
Sort
Collect
DAG = Directed Acyclic Graph
11. •Resilient Distributed Data sets
•Features
•Read only
•Fault tolerance without replication
•Uses data lineage for recovery
•Low network I/O
•Partitions/Slices
•parallel tasks
RDD
Disk
Transform 1
RDD 1
Transform 2
RDD 2
Data partitions
15. Why Spark Streaming
•Near real time processing (0.5 – 2 sec latency)
•Parallel recovery of lost nodes and stragglers
•Implementation of Lambda architecture
•Single engine for batch and stream
•Not suited for low latency requirements
•i.e., 100ms
16. Apache Storm vs Spark Streaming
Feature
Spark Streaming
Storm
Processing Model
Micro-Batching
Event Stream processing
Message Delivery options
Inherently fault tolerant, exactly once delivery
At least once, at most once, exactly once
Flexibility
Coarse grained transformation
Fine grained transformation
Implemented in
Scala
Clojure
Development Cost
Common platform for both batch and stream
Only stream – separate setup for batch
Applicability
Machine learning, Interactive analytics, near real time analytics
Near real time analytics, Natural language processing
17. GraphX & MLlib
• Data parallel Vs Graph Parallel processing
• Wikipedia search vs Facebook connection search, Page
rank
• Spark MLlib implements high quality machine
learning algorithms
• Iterative Algorithm Paradigm
• Leverage Spark’s in memory data sets
( ) (t 1) t x f x
f(xt) f(xt+1)
x(t) x(t+1)
22. Apache Spark
•New architecture
•RDD, DAG
•In memory processing
•Map reduce and more
•GraphX
•MLlib
•Spark streaming
Summary
Ecosystem tools
•Spark R
•Blink DB
•Storm
Spark performance
•GBs per second
•RAM to data size
•Inflexion point
23. Questions
Prajod Vettiyattil
Architect, Open source
Wipro
@prajods
in.linkedin.com/in/prajod
Namitha M S
Architect, Advanced Technologies
Wipro
in.linkedin.com/in/namithams