SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Resilient Distributed Datasets: A Fault- 
Tolerant Abstraction for In-Memory Cluster 
Computing 
Matei Zaharia, Mosharaf Chowdhury... 
2012 University of California, Berkeley
OUTLINE 
• Introduction 
• Resilient Distributed Datasets (RDDs) 
• Representing RDDs 
• Evaluation 
• Conclusion
Introduction 
Cluster computing frameworks like MapReduce is not 
well in iterative machine learning and graph algorithms 
because data replication,disk I/O,serialization
Introduction 
Pregel is a system for iterative graph computations that 
keeps intermediate data in memory, while HaLoop 
offers an iterative MapReduce interface. 
but only support specific computation patterns 
They do not provide abstractions for more general 
reuse.
Introduction 
RDD is defining a programming interface that can 
provide fault tolerance efficiently 
RDD v.s distributed shared memory 
coarse-grained transformations 
(e.g., map, filter and join) 
fine-grained updates to mutable state 
lineage
Resilient Distributed 
Datasets (RDDs) 
RDD’s transformation are lazy operations that define a 
new RDD, while actions launch a computation to 
return a value to the program or write data to external 
storage.
Resilient Distributed 
Datasets (RDDs)
Resilient Distributed 
Datasets (RDDs) 
RDD is a read-only, partitioned collection of records, 
only be created (1) data in stable storage (2) other 
RDDs. 
lines = spark.textFile("hdfs://...") 
errors = lines.filter(_.startsWith("ERROR")) 
errors.count()
Resilient Distributed 
Datasets (RDDs) 
RDD1 
lines = spark.textFile(“hdfs://...") 
RDD2 
errors = lines.filter(_.startsWith(“ERROR")) 
Long 
number = errors.count() 
RDD1 RDD2 
Long 
tranformation action
Resilient Distributed 
Datasets (RDDs) 
DEMO
Resilient Distributed 
Datasets (RDDs) 
RDD1 
lines = spark.textFile(“hdfs://...") 
RDD2 
errors = lines.filter(_.startsWith(“ERROR")) 
RDD3 
error = errors.persist() or cache() 
RDD3 error will in memory
Resilient Distributed 
Datasets (RDDs) 
Lineage: fault tolerance 
if RDD2 lost 
tranformation action 
RDD1 RDD2 Long 
recompute RDD1 and produce new RDD2
Resilient Distributed 
Datasets (RDDs) 
Spark provides the RDD abstraction through a 
language-integrated API 
scala 
a functional programming language for the Java VM
Representing RDDs 
dependencies between RDDs 
narrow dependencies:allow for pipelined execution on 
one cluster node 
wide dependencies:require data from all parent 
partitions to be available and to be shuffled across the 
nodes using a MapReduce-like operation
Representing RDDs 
in same node in different node
Representing RDDs 
how spark compute job stages 
partition 
RDD 
RDD in memory
Resilient Distributed 
Datasets (RDDs) 
Each stage contains as many pipelined transformations 
with narrow dependencies as possible. 
because avoid shuffled across the nodes
Evaluation 
Amazon:m1.xlarge EC2 nodes with 4 cores and 
15 GB of RAM. We used HDFS for storage, with 
256 MB blocks.
Evaluation 
10 iterations on 100 GB datasets using 25–100 
machines. 
logistic regression k-means 
logistic regression is less compute-intensive and thus more 
sensitive to time spent in deserialization and I/O.
Evaluation 
HadoopBinMem:convert input data to binary format,in memory
Evaluation 
pagerank 
54 GB Wikipedia dump, 4 million articles. 
iterations :10
Evaluation 
pagerank iterations :10
Evaluation 
fault recovery 
k-means 
100GB data,75 node ,iterations :10 
one node fail at the start of the 6th iteration.
Evaluation 
k-means 100GB data 75 node iterations :10
Evaluation 
Behavior with Insufficient Memory 
logistic regression 
100GB data , 25machine
Evaluation 
k-means 100GB data 25machine
Conclusion 
RDDs,an efficient, general-purpose and fault-tolerant 
abstraction for sharing data in cluster applications. 
RDDs offer an API based on coarse- grained 
transformations that lets them recover data efficiently 
using lineage. 
Spark v.s Hadoop fast to 20× in iterative applications and 
can be used interactively to query hundreds of gigabytes 
of data.

Contenu connexe

Tendances

Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 

Tendances (20)

Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Spark
SparkSpark
Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

En vedette

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA
 

En vedette (8)

Scheduling Policies in YARN
Scheduling Policies in YARNScheduling Policies in YARN
Scheduling Policies in YARN
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 

Similaire à RDD

dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache SparkGao Yunzhong
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working setsJinxinTang
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 

Similaire à RDD (20)

dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Spark
SparkSpark
Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Spark
SparkSpark
Spark
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 

Plus de Tien-Yang (Aiden) Wu

Plus de Tien-Yang (Aiden) Wu (14)

Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
沒有想像中簡單的簡單分類器 Knn
沒有想像中簡單的簡單分類器 Knn沒有想像中簡單的簡單分類器 Knn
沒有想像中簡單的簡單分類器 Knn
 
Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...Scalable sentiment classification for big data analysis using naive bayes cla...
Scalable sentiment classification for big data analysis using naive bayes cla...
 
Collaborative filtering
Collaborative filteringCollaborative filtering
Collaborative filtering
 
Collaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on HadoopCollaborative Filtering Recommendation Algorithm based on Hadoop
Collaborative Filtering Recommendation Algorithm based on Hadoop
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
K means
K meansK means
K means
 
Semantic ui教學
Semantic ui教學Semantic ui教學
Semantic ui教學
 
響應式網頁教學
響應式網頁教學響應式網頁教學
響應式網頁教學
 
NoSQL & JSON
NoSQL & JSONNoSQL & JSON
NoSQL & JSON
 
Weebly上手教學
Weebly上手教學Weebly上手教學
Weebly上手教學
 
簡易爬蟲製作和Pttcrawler
簡易爬蟲製作和Pttcrawler簡易爬蟲製作和Pttcrawler
簡易爬蟲製作和Pttcrawler
 
Python簡介和多版本虛擬環境架設
Python簡介和多版本虛擬環境架設Python簡介和多版本虛擬環境架設
Python簡介和多版本虛擬環境架設
 

Dernier

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Dernier (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

RDD

  • 1. Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury... 2012 University of California, Berkeley
  • 2. OUTLINE • Introduction • Resilient Distributed Datasets (RDDs) • Representing RDDs • Evaluation • Conclusion
  • 3. Introduction Cluster computing frameworks like MapReduce is not well in iterative machine learning and graph algorithms because data replication,disk I/O,serialization
  • 4. Introduction Pregel is a system for iterative graph computations that keeps intermediate data in memory, while HaLoop offers an iterative MapReduce interface. but only support specific computation patterns They do not provide abstractions for more general reuse.
  • 5. Introduction RDD is defining a programming interface that can provide fault tolerance efficiently RDD v.s distributed shared memory coarse-grained transformations (e.g., map, filter and join) fine-grained updates to mutable state lineage
  • 6. Resilient Distributed Datasets (RDDs) RDD’s transformation are lazy operations that define a new RDD, while actions launch a computation to return a value to the program or write data to external storage.
  • 8. Resilient Distributed Datasets (RDDs) RDD is a read-only, partitioned collection of records, only be created (1) data in stable storage (2) other RDDs. lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.count()
  • 9. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) Long number = errors.count() RDD1 RDD2 Long tranformation action
  • 11. Resilient Distributed Datasets (RDDs) RDD1 lines = spark.textFile(“hdfs://...") RDD2 errors = lines.filter(_.startsWith(“ERROR")) RDD3 error = errors.persist() or cache() RDD3 error will in memory
  • 12. Resilient Distributed Datasets (RDDs) Lineage: fault tolerance if RDD2 lost tranformation action RDD1 RDD2 Long recompute RDD1 and produce new RDD2
  • 13. Resilient Distributed Datasets (RDDs) Spark provides the RDD abstraction through a language-integrated API scala a functional programming language for the Java VM
  • 14. Representing RDDs dependencies between RDDs narrow dependencies:allow for pipelined execution on one cluster node wide dependencies:require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation
  • 15. Representing RDDs in same node in different node
  • 16. Representing RDDs how spark compute job stages partition RDD RDD in memory
  • 17. Resilient Distributed Datasets (RDDs) Each stage contains as many pipelined transformations with narrow dependencies as possible. because avoid shuffled across the nodes
  • 18. Evaluation Amazon:m1.xlarge EC2 nodes with 4 cores and 15 GB of RAM. We used HDFS for storage, with 256 MB blocks.
  • 19. Evaluation 10 iterations on 100 GB datasets using 25–100 machines. logistic regression k-means logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.
  • 20. Evaluation HadoopBinMem:convert input data to binary format,in memory
  • 21. Evaluation pagerank 54 GB Wikipedia dump, 4 million articles. iterations :10
  • 23. Evaluation fault recovery k-means 100GB data,75 node ,iterations :10 one node fail at the start of the 6th iteration.
  • 24. Evaluation k-means 100GB data 75 node iterations :10
  • 25. Evaluation Behavior with Insufficient Memory logistic regression 100GB data , 25machine
  • 26. Evaluation k-means 100GB data 25machine
  • 27. Conclusion RDDs,an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. RDDs offer an API based on coarse- grained transformations that lets them recover data efficiently using lineage. Spark v.s Hadoop fast to 20× in iterative applications and can be used interactively to query hundreds of gigabytes of data.