This document discusses Resilient Distributed Datasets (RDD), a fault-tolerant abstraction in Apache Spark for cluster computing. RDDs allow data to be reused across computations and support transformations like map, filter, and join. RDDs can be created from stable storage or other RDDs, and Spark computes them lazily for efficiency. The document provides examples of how RDDs can express algorithms like MapReduce, SQL queries, and graph processing. Benchmarks show Spark is 20x faster than Hadoop for iterative algorithms due to RDDs enabling data reuse in memory across jobs.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Apache Spark
1. Resilient Distributed Dataset: A Fault-Tolerant
Abstraction For In-Memory Cluster Computing
Mahdi Esmail oghli
Dr. Bagheri
AmirKabir University of technology
SPARK
http://BigData.ceit.aut.ac.ir
7. Problems with current computing frameworks
Specially Map-reduce
Provides abstraction for accessing
the cluster’s computational
resources
Lack of abstraction for distributed
memory
7
8. Problems with current computing frameworks
Specially Map-reduce
Makes them inefficient for those
that reuse intermediate results
across multiple computations
8
9. SPARK Motivation
Problems with current computing
frameworks (ex. Map-Reduce):
Iterative algorithms
Interactive data mining tools
9
10. Data reuse examples
Iterative machine learning and graph
algorithms:
Page Rank
K-means clustering
Logistic regression
10
11. Data reuse examples
Interactive Data Mining (runs
multiple queries on the same subset
of data) :
Statistical queries
Fraud detection
Stream queries
11
12. Current Solution
The only way to reuse data between
computations with current frameworks:
Write it to an external stable storage
system. X
12
17. RDD
Read-Only partitioned collection of
records.
Can be created on either stable
storage or other RDDs (using
transformations).
User can control Persistence and
Partitioning
17
18. RDD
Efficient data reuse
Parallel data structure
Allow explicit persist results
In-memory computation
Large clusters
fault-tolerant manner
18
19. Current fault tolerant approaches
Data replication across machines
Log update across machines
19
28. Persistent Function
Indicates which RDDs we want to
reuse in the future actions.
Other persistence strategies like:
Store the RDD only in disk
Replicating across machines
Set persistence priorities to RDDs.
28
30. What benchmarks
show about SPARK
20X faster than
HADOOP for
iterative
applications
It can scan 1TB
dataset with 5-7s
latency
30
100 GB Data
100 node
38. References
38
Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified
data processing on large clusters." Communications of the
ACM 51.1 (2008): 107-113.
Zaharia, Matei, et al. "Resilient distributed datasets: A fault-
tolerant abstraction for in-memory cluster computing."
Proceedings of the 9th USENIX conference on Networked
Systems Design and Implementation. USENIX Association,
2012.
http://Spark.apache.org
https://databricks.com