This document discusses Resilient Distributed Datasets (RDDs), which provide a fault-tolerant abstraction for in-memory cluster computing. RDDs allow data to be partitioned across clusters and cached in memory for efficient reuse across jobs. The Spark framework exposes the RDD API and uses lineage graphs to recover lost data partitions. Experiments show Spark can be 20x faster than Hadoop for iterative jobs by avoiding serialization and reducing disk I/O through in-memory caching of RDDs.
3. 1
Motivation
How to perform large-scale data analytics?
● MapReduce
● Dryad
Problem? Overhead!!
● reuse intermediate? DFS? no abstraction for
● Pregel? general reuse!!
● How to provide Fault-tolerance
efficiently? Shared memory? key-value stores?
Picollo?
Fine-grained!!
4. 2
RDDs Overview
Read-only, partitioned collection of records
Created through transformations on data in
stable storage or other RDDs
Has information on the lineage of
transformations
Control over partitioning and persistence
(e.g. non serialized in-memory storage)
5. 3
Spark
Exposes RDDs through a language
integrated API.
RDDs can be used in actions.
● which return a value or export it to a storage system
(e.g. count, collect and save)
Persist method indicates which RDDs to reuse
(default: stored in memory)
6. 4
Data Sharing in MReduce
Overhead: Replication, serialization, disk IO!
8. 6
Example - Log Mining
Load error messages into memory and search
for patterns.
1Tb in 5-7 sec
(170 sec for on-disk data)
9. 7
Fault Tolerance
RDDs keep information of the transformations
used to build them. This lineage can be used to
recover lost data.
10. Example - Logistic 8
Regression
One time loaded into memory!
Repeated MapReduce steps to
calculate the gradient
Many machine learning algorithms are iterative in nature
because they run iterative optimization procedures!
12. 10
Representing RDDs
Wide dependencies
are harder to recover!
Wide dependencies
require data from all
parents
Narrow dependencies
allow pipelined
execution
Partition
13. 11
Evaluation - Iteration times
Computation
intensive
Extra MR job to
convert to binary
Heartbeat
Protocol
14. Evaluation - 12
number of machines
1.9x &
25.3x & 3.2x
20.7x
17. 15
Conclusion
Spark is up to 20x faster than Hadoop for
iterative applications. (IO and serialization)
Can interactively scan 1 TB (5-7s latency).
Quick recovery (builds lost RDD partitions).
Pregel/HaLoop can be built on top of Spark.
Good for batch applications that apply the
same operation to all elements of a dataset.
18. References
● Resilient Distributed Datasets : A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing
● slideshare :/Hadoop_Summit/spark-and-
shark