SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Resilient Distributed
Datasets : A Fault-Tolerant
Abstraction for In-Memory
   Cluster Computing
     Presentation by Mário Almeida
Outline
Motivation
RDDs overview
Spark
Data Sharing
Example : Log Mining
Fault Tolerance
Example : Logistic Regression
RDD Representation
Evaluation
Conclusion
1
Motivation
How to perform large-scale data analytics?
● MapReduce
● Dryad

Problem?                                  Overhead!!

● reuse intermediate? DFS?             no abstraction for
● Pregel?                               general reuse!!
● How to provide Fault-tolerance
  efficiently? Shared memory? key-value stores?
   Picollo?
                                            Fine-grained!!
2
RDDs Overview
Read-only, partitioned collection of records

Created through transformations on data in
stable storage or other RDDs

Has information on the lineage of
transformations

Control over partitioning and persistence
(e.g. non serialized in-memory storage)
3
Spark
Exposes RDDs through a language
integrated API.

RDDs can be used in actions.
  ● which return a value or export it to a storage system
    (e.g. count, collect and save)

Persist method indicates which RDDs to reuse
(default: stored in memory)
4
Data Sharing in MReduce




Overhead: Replication, serialization, disk IO!
5
Data Sharing in Spark




   10-100x faster than network and disk
6
Example - Log Mining
Load error messages into memory and search
for patterns.




1Tb in 5-7 sec
(170 sec for on-disk data)
7
Fault Tolerance
RDDs keep information of the transformations
used to build them. This lineage can be used to
recover lost data.
Example - Logistic                                                8

Regression

                                   One time loaded into memory!




                                  Repeated MapReduce steps to
                                     calculate the gradient




Many machine learning algorithms are iterative in nature
because they run iterative optimization procedures!
Logistic Regression                         9

Performance




30Gb set
20 * 4 cores w/ 15GB
Hadoop - 127 s/iteration
Spark . 1st iteration 174s, afterwards 6s
10
Representing RDDs
         Wide dependencies
        are harder to recover!

                                 Wide dependencies
                                 require data from all
                                       parents

                                 Narrow dependencies
                                    allow pipelined
                                       execution




                                  Partition
11
    Evaluation - Iteration times
                                Computation
                                intensive


            Extra MR job to
            convert to binary




Heartbeat
Protocol
Evaluation -             12

number of machines

                     1.9x &
         25.3x &      3.2x
          20.7x
13
Evaluation - Partitioning
Page rank algorithm on a 54GB dataset that
builds a link graph of 4 million articles.
14
Evaluation - Failures
100 GB Working set
15
Conclusion
Spark is up to 20x faster than Hadoop for
iterative applications. (IO and serialization)

Can interactively scan 1 TB (5-7s latency).

Quick recovery (builds lost RDD partitions).

Pregel/HaLoop can be built on top of Spark.

Good for batch applications that apply the
same operation to all elements of a dataset.
References
● Resilient Distributed Datasets : A Fault-
  Tolerant Abstraction for In-Memory Cluster
  Computing
● slideshare :/Hadoop_Summit/spark-and-
  shark

Contenu connexe

Tendances

80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recoverymapr-academy
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
 
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
DHT2 - O Brother, Where Art Thou with 	Shyam RanganathanDHT2 - O Brother, Where Art Thou with 	Shyam Ranganathan
DHT2 - O Brother, Where Art Thou with Shyam RanganathanGluster.org
 
55a remote cluster
55a remote cluster55a remote cluster
55a remote clustermapr-academy
 
Digital Forensics
Digital ForensicsDigital Forensics
Digital Forensicsblabadini
 
HBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFSHBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFSCloudera, Inc.
 
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Principled Technologies
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesSlide_N
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
Ceph Day Berlin: Measuring and predicting performance of Ceph clustersCeph Day Berlin: Measuring and predicting performance of Ceph clusters
Ceph Day Berlin: Measuring and predicting performance of Ceph clustersCeph Community
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneEnkitec
 

Tendances (16)

80a disaster recovery
80a disaster recovery80a disaster recovery
80a disaster recovery
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
DHT2 - O Brother, Where Art Thou with 	Shyam RanganathanDHT2 - O Brother, Where Art Thou with 	Shyam Ranganathan
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
 
58a migration
58a migration58a migration
58a migration
 
55a remote cluster
55a remote cluster55a remote cluster
55a remote cluster
 
Digital Forensics
Digital ForensicsDigital Forensics
Digital Forensics
 
HBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFSHBase User Group #9: HBase and HDFS
HBase User Group #9: HBase and HDFS
 
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...
 
MapReduce
MapReduceMapReduce
MapReduce
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
Ch12
Ch12Ch12
Ch12
 
High-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation ConsolesHigh-Performance Physics Solver Design for Next Generation Consoles
High-Performance Physics Solver Design for Next Generation Consoles
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
Ceph Day Berlin: Measuring and predicting performance of Ceph clustersCeph Day Berlin: Measuring and predicting performance of Ceph clusters
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry Osborne
 

Similaire à Spark

Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademySyed Hadoop
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 

Similaire à Spark (20)

Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
RDD
RDDRDD
RDD
 
Spark
SparkSpark
Spark
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Spark_RDD_SyedAcademy
Spark_RDD_SyedAcademySpark_RDD_SyedAcademy
Spark_RDD_SyedAcademy
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Spark 101
Spark 101Spark 101
Spark 101
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 

Plus de Mário Almeida

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingMário Almeida
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeMário Almeida
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalabilityMário Almeida
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsMário Almeida
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelizationMário Almeida
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacksMário Almeida
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsMário Almeida
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytraceMário Almeida
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabricMário Almeida
 

Plus de Mário Almeida (13)

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application Scheduling
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skype
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalability
 
Dimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache SimulationsDimemas and Multi-Level Cache Simulations
Dimemas and Multi-Level Cache Simulations
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelization
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacks
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed Systems
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytrace
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabric
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 

Spark

  • 1. Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Mário Almeida
  • 2. Outline Motivation RDDs overview Spark Data Sharing Example : Log Mining Fault Tolerance Example : Logistic Regression RDD Representation Evaluation Conclusion
  • 3. 1 Motivation How to perform large-scale data analytics? ● MapReduce ● Dryad Problem? Overhead!! ● reuse intermediate? DFS? no abstraction for ● Pregel? general reuse!! ● How to provide Fault-tolerance efficiently? Shared memory? key-value stores? Picollo? Fine-grained!!
  • 4. 2 RDDs Overview Read-only, partitioned collection of records Created through transformations on data in stable storage or other RDDs Has information on the lineage of transformations Control over partitioning and persistence (e.g. non serialized in-memory storage)
  • 5. 3 Spark Exposes RDDs through a language integrated API. RDDs can be used in actions. ● which return a value or export it to a storage system (e.g. count, collect and save) Persist method indicates which RDDs to reuse (default: stored in memory)
  • 6. 4 Data Sharing in MReduce Overhead: Replication, serialization, disk IO!
  • 7. 5 Data Sharing in Spark 10-100x faster than network and disk
  • 8. 6 Example - Log Mining Load error messages into memory and search for patterns. 1Tb in 5-7 sec (170 sec for on-disk data)
  • 9. 7 Fault Tolerance RDDs keep information of the transformations used to build them. This lineage can be used to recover lost data.
  • 10. Example - Logistic 8 Regression One time loaded into memory! Repeated MapReduce steps to calculate the gradient Many machine learning algorithms are iterative in nature because they run iterative optimization procedures!
  • 11. Logistic Regression 9 Performance 30Gb set 20 * 4 cores w/ 15GB Hadoop - 127 s/iteration Spark . 1st iteration 174s, afterwards 6s
  • 12. 10 Representing RDDs Wide dependencies are harder to recover! Wide dependencies require data from all parents Narrow dependencies allow pipelined execution Partition
  • 13. 11 Evaluation - Iteration times Computation intensive Extra MR job to convert to binary Heartbeat Protocol
  • 14. Evaluation - 12 number of machines 1.9x & 25.3x & 3.2x 20.7x
  • 15. 13 Evaluation - Partitioning Page rank algorithm on a 54GB dataset that builds a link graph of 4 million articles.
  • 17. 15 Conclusion Spark is up to 20x faster than Hadoop for iterative applications. (IO and serialization) Can interactively scan 1 TB (5-7s latency). Quick recovery (builds lost RDD partitions). Pregel/HaLoop can be built on top of Spark. Good for batch applications that apply the same operation to all elements of a dataset.
  • 18. References ● Resilient Distributed Datasets : A Fault- Tolerant Abstraction for In-Memory Cluster Computing ● slideshare :/Hadoop_Summit/spark-and- shark