Spark

•

3 j'aime•1,419 vues

This document discusses Resilient Distributed Datasets (RDDs), which provide a fault-tolerant abstraction for in-memory cluster computing. RDDs allow data to be partitioned across clusters and cached in memory for efficient reuse across jobs. The Spark framework exposes the RDD API and uses lineage graphs to recover lost data partitions. Experiments show Spark can be 20x faster than Hadoop for iterative jobs by avoiding serialization and reducing disk I/O through in-memory caching of RDDs.

Resilient Distributed
Datasets : A Fault-Tolerant
Abstraction for In-Memory
Cluster Computing
Presentation by Mário Almeida

Outline
Motivation
RDDs overview
Spark
Data Sharing
Example : Log Mining
Fault Tolerance
Example : Logistic Regression
RDD Representation
Evaluation
Conclusion

1
Motivation
How to perform large-scale data analytics?
● MapReduce
● Dryad

Problem? Overhead!!

● reuse intermediate? DFS? no abstraction for
● Pregel? general reuse!!
● How to provide Fault-tolerance
efficiently? Shared memory? key-value stores?
Picollo?
Fine-grained!!

2
RDDs Overview
Read-only, partitioned collection of records

Created through transformations on data in
stable storage or other RDDs

Has information on the lineage of
transformations

Control over partitioning and persistence
(e.g. non serialized in-memory storage)

3
Spark
Exposes RDDs through a language
integrated API.

RDDs can be used in actions.
● which return a value or export it to a storage system
(e.g. count, collect and save)

Persist method indicates which RDDs to reuse
(default: stored in memory)

4
Data Sharing in MReduce

Overhead: Replication, serialization, disk IO!

5
Data Sharing in Spark

10-100x faster than network and disk

6
Example - Log Mining
Load error messages into memory and search
for patterns.

1Tb in 5-7 sec
(170 sec for on-disk data)

7
Fault Tolerance
RDDs keep information of the transformations
used to build them. This lineage can be used to
recover lost data.

Example - Logistic 8

Regression

One time loaded into memory!

Repeated MapReduce steps to
calculate the gradient

Many machine learning algorithms are iterative in nature
because they run iterative optimization procedures!

Logistic Regression 9

Performance

30Gb set
20 * 4 cores w/ 15GB
Hadoop - 127 s/iteration
Spark . 1st iteration 174s, afterwards 6s

10
Representing RDDs
Wide dependencies
are harder to recover!

Wide dependencies
require data from all
parents

Narrow dependencies
allow pipelined
execution

Partition

11
Evaluation - Iteration times
Computation
intensive

Extra MR job to
convert to binary

Heartbeat
Protocol

Evaluation - 12

number of machines

1.9x &
25.3x & 3.2x
20.7x

13
Evaluation - Partitioning
Page rank algorithm on a 54GB dataset that
builds a link graph of 4 million articles.

14
Evaluation - Failures
100 GB Working set

15
Conclusion
Spark is up to 20x faster than Hadoop for
iterative applications. (IO and serialization)

Can interactively scan 1 TB (5-7s latency).

Quick recovery (builds lost RDD partitions).

Pregel/HaLoop can be built on top of Spark.

Good for batch applications that apply the
same operation to all elements of a dataset.

References
● Resilient Distributed Datasets : A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing
● slideshare :/Hadoop_Summit/spark-and-
shark

Contenu connexe

Tendances

80a disaster recoverymapr-academy

70a monitoring & troubleshootingmapr-academy

Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.

DHT2 - O Brother, Where Art Thou with Shyam RanganathanGluster.org

58a migrationmapr-academy

55a remote clustermapr-academy

Digital Forensicsblabadini

HBase User Group #9: HBase and HDFSCloudera, Inc.

Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Principled Technologies

MapReduceKavyaGo

Kerry osborne hadoop meets exadataEnkitec

Ch12Subhankar Chowdhury

High-Performance Physics Solver Design for Next Generation ConsolesSlide_N

Hadoop - HDFSKavyaGo

Ceph Day Berlin: Measuring and predicting performance of Ceph clustersCeph Community

Hadoop Meets Exadata- Kerry OsborneEnkitec

Tendances (16)

80a disaster recovery

70a monitoring & troubleshooting

Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks

DHT2 - O Brother, Where Art Thou with Shyam Ranganathan

58a migration

55a remote cluster

Digital Forensics

HBase User Group #9: HBase and HDFS

Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...

MapReduce

Kerry osborne hadoop meets exadata

Ch12

High-Performance Physics Solver Design for Next Generation Consoles

Hadoop - HDFS

Ceph Day Berlin: Measuring and predicting performance of Ceph clusters

Hadoop Meets Exadata- Kerry Osborne

Similaire à Spark

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

RDDTien-Yang (Aiden) Wu

SparkHeena Madan

dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar

Apache Spark overviewDataArt

Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk

Study Notes: Apache SparkGao Yunzhong

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Unit II Real Time Data Processing tools.pptxRahul Borate

Spark_RDD_SyedAcademySyed Hadoop

Hadoop vs sparkamarkayam

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution

Spark 101Shahaf Azriely {TopLinked} ☁

Bds session 13 14Infinity Tech Solutions

Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk

Spark cluster computing with working setsJinxinTang

Apache Spark II (SparkSQL)Datio Big Data

APACHE SPARK.pptxDeepaThirumurugan

Seattle Scalability Meetup - Ted Dunning - MapRclive boulton

Similaire à Spark (20)

Geek Night - Functional Data Processing using Spark and Scala

RDD

Spark

dmapply: A functional primitive to express distributed machine learning algor...

Apache Spark overview

Secrets of Spark's success - Deenar Toraskar, Think Reactive

Study Notes: Apache Spark

Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab

Unit II Real Time Data Processing tools.pptx

Spark_RDD_SyedAcademy

Hadoop vs spark

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃

Spark 101

Bds session 13 14

Scheduling in distributed systems - Andrii Vozniuk

Spark cluster computing with working sets

Apache Spark II (SparkSQL)

APACHE SPARK.pptx

Seattle Scalability Meetup - Ted Dunning - MapR

Plus de Mário Almeida

Empirical Study of Android Alarm Usage for Application SchedulingMário Almeida

Android reverse engineering - Analyzing skypeMário Almeida

Flume impact of reliability on scalabilityMário Almeida

Dimemas and Multi-Level Cache SimulationsMário Almeida

Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida

Smith waterman algorithm parallelizationMário Almeida

Man-In-The-Browser attacksMário Almeida

Flume-based Independent News AggregatorMário Almeida

Exploiting Availability Prediction in Distributed SystemsMário Almeida

High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida

Instrumenting parsecs raytraceMário Almeida

Architecting a cloud scale identity fabricMário Almeida

SOAP vs RESTMário Almeida

Plus de Mário Almeida (13)

Empirical Study of Android Alarm Usage for Application Scheduling

Android reverse engineering - Analyzing skype

Flume impact of reliability on scalability

Dimemas and Multi-Level Cache Simulations

Self-Adapting, Energy-Conserving Distributed File Systems

Smith waterman algorithm parallelization

Man-In-The-Browser attacks

Flume-based Independent News Aggregator

Exploiting Availability Prediction in Distributed Systems

High Availability of Services in Wide-Area Shared Computing Networks

Instrumenting parsecs raytrace

Architecting a cloud scale identity fabric

SOAP vs REST

Spark

1. Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Mário Almeida

2. Outline Motivation RDDs overview Spark Data Sharing Example : Log Mining Fault Tolerance Example : Logistic Regression RDD Representation Evaluation Conclusion

3. 1 Motivation How to perform large-scale data analytics? ● MapReduce ● Dryad Problem? Overhead!! ● reuse intermediate? DFS? no abstraction for ● Pregel? general reuse!! ● How to provide Fault-tolerance efficiently? Shared memory? key-value stores? Picollo? Fine-grained!!

4. 2 RDDs Overview Read-only, partitioned collection of records Created through transformations on data in stable storage or other RDDs Has information on the lineage of transformations Control over partitioning and persistence (e.g. non serialized in-memory storage)

5. 3 Spark Exposes RDDs through a language integrated API. RDDs can be used in actions. ● which return a value or export it to a storage system (e.g. count, collect and save) Persist method indicates which RDDs to reuse (default: stored in memory)

6. 4 Data Sharing in MReduce Overhead: Replication, serialization, disk IO!

7. 5 Data Sharing in Spark 10-100x faster than network and disk

8. 6 Example - Log Mining Load error messages into memory and search for patterns. 1Tb in 5-7 sec (170 sec for on-disk data)

9. 7 Fault Tolerance RDDs keep information of the transformations used to build them. This lineage can be used to recover lost data.

10. Example - Logistic 8 Regression One time loaded into memory! Repeated MapReduce steps to calculate the gradient Many machine learning algorithms are iterative in nature because they run iterative optimization procedures!

11. Logistic Regression 9 Performance 30Gb set 20 * 4 cores w/ 15GB Hadoop - 127 s/iteration Spark . 1st iteration 174s, afterwards 6s

12. 10 Representing RDDs Wide dependencies are harder to recover! Wide dependencies require data from all parents Narrow dependencies allow pipelined execution Partition

13. 11 Evaluation - Iteration times Computation intensive Extra MR job to convert to binary Heartbeat Protocol

14. Evaluation - 12 number of machines 1.9x & 25.3x & 3.2x 20.7x

15. 13 Evaluation - Partitioning Page rank algorithm on a 54GB dataset that builds a link graph of 4 million articles.

16. 14 Evaluation - Failures 100 GB Working set

17. 15 Conclusion Spark is up to 20x faster than Hadoop for iterative applications. (IO and serialization) Can interactively scan 1 TB (5-7s latency). Quick recovery (builds lost RDD partitions). Pregel/HaLoop can be built on top of Spark. Good for batch applications that apply the same operation to all elements of a dataset.

18. References ● Resilient Distributed Datasets : A Fault- Tolerant Abstraction for In-Memory Cluster Computing ● slideshare :/Hadoop_Summit/spark-and- shark

Spark

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (16)

Similaire à Spark

Similaire à Spark (20)

Plus de Mário Almeida

Plus de Mário Almeida (13)

Spark