RDD

•

3 j'aime•1,998 vues

Tien-Yang (Aiden) Wu

referance:Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Logiciels

Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing
Matei Zaharia, Mosharaf Chowdhury...
2012 University of California, Berkeley

OUTLINE
• Introduction
• Resilient Distributed Datasets (RDDs)
• Representing RDDs
• Evaluation
• Conclusion

Introduction
Cluster computing frameworks like MapReduce is not
well in iterative machine learning and graph algorithms
because data replication,disk I/O,serialization

Introduction
Pregel is a system for iterative graph computations that
keeps intermediate data in memory, while HaLoop
offers an iterative MapReduce interface.
but only support specific computation patterns
They do not provide abstractions for more general
reuse.

Introduction
RDD is defining a programming interface that can
provide fault tolerance efficiently
RDD v.s distributed shared memory
coarse-grained transformations
(e.g., map, filter and join)
fine-grained updates to mutable state
lineage

Resilient Distributed
Datasets (RDDs)
RDD’s transformation are lazy operations that define a
new RDD, while actions launch a computation to
return a value to the program or write data to external
storage.

Resilient Distributed
Datasets (RDDs)
RDD is a read-only, partitioned collection of records,
only be created (1) data in stable storage (2) other
RDDs.
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.count()

Resilient Distributed
Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
Long
number = errors.count()
RDD1 RDD2
Long
tranformation action

Resilient Distributed
Datasets (RDDs)
DEMO

Resilient Distributed
Datasets (RDDs)
RDD1
lines = spark.textFile(“hdfs://...")
RDD2
errors = lines.filter(_.startsWith(“ERROR"))
RDD3
error = errors.persist() or cache()
RDD3 error will in memory

Resilient Distributed
Datasets (RDDs)
Lineage: fault tolerance
if RDD2 lost
tranformation action
RDD1 RDD2 Long
recompute RDD1 and produce new RDD2

Resilient Distributed
Datasets (RDDs)
Spark provides the RDD abstraction through a
language-integrated API
scala
a functional programming language for the Java VM

Representing RDDs
dependencies between RDDs
narrow dependencies：allow for pipelined execution on
one cluster node
wide dependencies：require data from all parent
partitions to be available and to be shuffled across the
nodes using a MapReduce-like operation

Representing RDDs
in same node in different node

Representing RDDs
how spark compute job stages
partition
RDD
RDD in memory

Resilient Distributed
Datasets (RDDs)
Each stage contains as many pipelined transformations
with narrow dependencies as possible.
because avoid shuffled across the nodes

Evaluation
Amazon：m1.xlarge EC2 nodes with 4 cores and
15 GB of RAM. We used HDFS for storage, with
256 MB blocks.

Evaluation
10 iterations on 100 GB datasets using 25–100
machines.
logistic regression k-means
logistic regression is less compute-intensive and thus more
sensitive to time spent in deserialization and I/O.

Evaluation
HadoopBinMem：convert input data to binary format,in memory

Evaluation
pagerank
54 GB Wikipedia dump, 4 million articles.
iterations :10

Evaluation
fault recovery
k-means
100GB data,75 node ,iterations :10
one node fail at the start of the 6th iteration.

Evaluation
k-means 100GB data 75 node iterations :10

Evaluation
Behavior with Insufficient Memory
logistic regression
100GB data , 25machine

Conclusion
RDDs,an efficient, general-purpose and fault-tolerant
abstraction for sharing data in cluster applications.
RDDs offer an API based on coarse- grained
transformations that lets them recover data efficiently
using lineage.
Spark v.s Hadoop fast to 20× in iterative applications and
can be used interactively to query hundreds of gigabytes
of data.

Contenu connexe

Tendances

Resilient Distributed DataSets - Apache SPARKTaposh Roy

Apache Spark overviewDataArt

Apache Spark CoreGirish Khanzode

Introduction to Apache ZooKeeperSaurav Haloi

Scalability, Availability & Stability PatternsJonas Bonér

Query Compilation in ImpalaCloudera, Inc.

Druid deep diveKashif Khan

Unit-3_BDA.pptPoojaShah174393

Hive tuningMichael Zhang

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

CDC patterns in Apache Kafka®confluent

The Impala CookbookCloudera, Inc.

SparkKoushik Mondal

Spark SQLJoud Khattab

Hadoop ArchitectureDr. C.V. Suresh Babu

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Apache Spark Introductionsudhakara st

Hive + Tez: A Performance Deep DiveDataWorks Summit

Optimizing Apache Spark SQL JoinsDatabricks

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Tendances (20)

Resilient Distributed DataSets - Apache SPARK

Apache Spark overview

Apache Spark Core

Introduction to Apache ZooKeeper

Scalability, Availability & Stability Patterns

Query Compilation in Impala

Druid deep dive

Unit-3_BDA.ppt

Hive tuning

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

CDC patterns in Apache Kafka®

The Impala Cookbook

Spark

Spark SQL

Hadoop Architecture

Introduction to Apache Flink - Fast and reliable big data processing

Apache Spark Introduction

Hive + Tez: A Performance Deep Dive

Optimizing Apache Spark SQL Joins

The Parquet Format and Performance Optimization Opportunities

En vedette

Scheduling Policies in YARNDataWorks Summit/Hadoop Summit

Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit

Apache Spark & HadoopMapR Technologies

Spark on YARNAdarsh Pannu

Spark on YarnQubole

Apache Spark RDDsDean Chen

Spark on yarndatamantra

Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA

En vedette (8)

Scheduling Policies in YARN

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Apache Spark & Hadoop

Spark on YARN

Spark on Yarn

Apache Spark RDDs

Spark on yarn

Dynamic Resource Allocation Spark on YARN

Similaire à RDD

dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar

Study Notes: Apache SparkGao Yunzhong

SparkMário Almeida

Introduction to SparkSriram Kailasam

Spark cluster computing with working setsJinxinTang

Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui

Bigdata processing with Spark - part IIArjen de Vries

Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics

Spark 计算模型wang xing

Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar

SparkHeena Madan

Zaharia spark-scala-days-2012Skills Matter Talks

SparkNotesDemet Aksoy

Big Data Analytics with Apache SparkMarcoYuriFujiiMelo

Apache Spark: What? Why? When?Massimo Schenone

Spark Summit East 2015 Advanced Devops Student SlidesDatabricks

Large Scale Machine Learning with Apache SparkCloudera, Inc.

BDAS RDD study report v1.2Stefanie Zhao

Spark training-in-bangaloreKelly Technologies

Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi

Similaire à RDD (20)

dmapply: A functional primitive to express distributed machine learning algor...

Study Notes: Apache Spark

Spark

Introduction to Spark

Spark cluster computing with working sets

Spark & Cassandra at DataStax Meetup on Jan 29, 2015

Bigdata processing with Spark - part II

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

Spark 计算模型

Geek Night - Functional Data Processing using Spark and Scala

Spark

Zaharia spark-scala-days-2012

SparkNotes

Big Data Analytics with Apache Spark

Apache Spark: What? Why? When?

Spark Summit East 2015 Advanced Devops Student Slides

Large Scale Machine Learning with Apache Spark

BDAS RDD study report v1.2

Spark training-in-bangalore

Big Data Analytics and Ubiquitous computing

Plus de Tien-Yang (Aiden) Wu

Hidden markov modelTien-Yang (Aiden) Wu

Scalable machine learningTien-Yang (Aiden) Wu

沒有想像中簡單的簡單分類器 KnnTien-Yang (Aiden) Wu

Scalable sentiment classification for big data analysis using naive bayes cla...Tien-Yang (Aiden) Wu

Collaborative filteringTien-Yang (Aiden) Wu

Collaborative Filtering Recommendation Algorithm based on HadoopTien-Yang (Aiden) Wu

Parallel-kmeansTien-Yang (Aiden) Wu

K meansTien-Yang (Aiden) Wu

Semantic ui教學Tien-Yang (Aiden) Wu

響應式網頁教學Tien-Yang (Aiden) Wu

NoSQL & JSONTien-Yang (Aiden) Wu

Weebly上手教學Tien-Yang (Aiden) Wu

簡易爬蟲製作和PttcrawlerTien-Yang (Aiden) Wu

Python簡介和多版本虛擬環境架設Tien-Yang (Aiden) Wu

Plus de Tien-Yang (Aiden) Wu (14)

Hidden markov model

Scalable machine learning

沒有想像中簡單的簡單分類器 Knn

Scalable sentiment classification for big data analysis using naive bayes cla...

Collaborative filtering

Collaborative Filtering Recommendation Algorithm based on Hadoop

Parallel-kmeans

K means

Semantic ui教學

響應式網頁教學

NoSQL & JSON

Weebly上手教學

簡易爬蟲製作和Pttcrawler

Python簡介和多版本虛擬環境架設

Dernier

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Software Quality Assurance Interview QuestionsArshad QA

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Professional Resume Template for Software DevelopersVinodh Ram

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

DNT_Corporate presentation know about usDynamic Netsoft

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

TECUNIQUE: Success Stories: IT Service providermohitmore19

Test Automation Strategy for Frontend and BackendArshad QA

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Dernier (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

5 Signs You Need a Fashion PLM Software.pdf

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Software Quality Assurance Interview Questions

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Professional Resume Template for Software Developers

Hand gesture recognition PROJECT PPT.pptx

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

DNT_Corporate presentation know about us

HR Software Buyers Guide in 2024 - HRSoftware.com

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Salesforce Certified Field Service Consultant

TECUNIQUE: Success Stories: IT Service provider

Test Automation Strategy for Frontend and Backend

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Optimizing AI for immediate response in Smart CCTV

RDD

1. Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley

2. OUTLINE • Introduction • Resilient Distributed Datasets (RDDs) • Representing RDDs • Evaluation • Conclusion

3. Introduction Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms because data replication,disk I/O,serialization

4. Introduction Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface. but only support specific computation patterns They do not provide abstractions for more general reuse.

5. Introduction RDD is defining a programming interface that can provide fault tolerance efficiently RDD v.s distributed shared memory coarse-grained transformations (e.g., map, filter and join) fine-grained updates to mutable state lineage

6. Resilient Distributed Datasets (RDDs) RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.

7. Resilient Distributed Datasets (RDDs)

8. Resilient Distributed Datasets (RDDs) RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()

9. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) Long number = errors.count() RDD1 RDD2 Long tranformation action

10. Resilient Distributed Datasets (RDDs) DEMO

11. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) RDD3 error = errors.persist() or cache() RDD3 error will in memory

12. Resilient Distributed Datasets (RDDs) Lineage: fault tolerance if RDD2 lost tranformation action RDD1 RDD2 Long recompute RDD1 and produce new RDD2

13. Resilient Distributed Datasets (RDDs) Spark provides the RDD abstraction through a language-integrated API scala a functional programming language for the Java VM

14. Representing RDDs dependencies between RDDs narrow dependencies：allow for pipelined execution on one cluster node wide dependencies：require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation

15. Representing RDDs in same node in different node

16. Representing RDDs how spark compute job stages partition RDD RDD in memory

17. Resilient Distributed Datasets (RDDs) Each stage contains as many pipelined transformations with narrow dependencies as possible. because avoid shuffled across the nodes

18. Evaluation Amazon：m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.

19. Evaluation 10 iterations on 100 GB datasets using 25–100 machines. logistic regression k-means logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.

20. Evaluation HadoopBinMem：convert input data to binary format,in memory

21. Evaluation pagerank 54 GB Wikipedia dump, 4 million articles. iterations :10

22. Evaluation pagerank iterations :10

23. Evaluation fault recovery k-means 100GB data,75 node ,iterations :10 one node fail at the start of the 6th iteration.

24. Evaluation k-means 100GB data 75 node iterations :10

25. Evaluation Behavior with Insufficient Memory logistic regression 100GB data , 25machine

26. Evaluation k-means 100GB data 25machine

27. Conclusion RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage. Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.

RDD

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (8)

Similaire à RDD

Similaire à RDD (20)

Plus de Tien-Yang (Aiden) Wu

Plus de Tien-Yang (Aiden) Wu (14)

Dernier

Dernier (20)

RDD