SlideShare une entreprise Scribd logo
1  sur  16
RDD – Overview 
(Resilient Distributed Datasets*) 
{ 
Nov 1st 2014 
Oakland CA 
By Taposh Dutta Roy 
* Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Contents 
• What is RDD 
• Motivation Behind RDD 
• Use Cases for RDD 
• Challenges for RDD 
• RDD: Solve
What is RDD 
“RDDs are fault tolerant, parallel data structures 
that let users explicitly persist intermediate 
results in memory, control their partitioning to 
optimize data placement, and manipulate them 
using a rich set of operations. “ 
In a nutshell RDDs are a level of abstraction 
that enable efficient data reuse in a broad 
range of applications
Motivation behind RDD 
Current frameworks like MapReduce & 
Dyrad provide a numerous abstractions for 
accessing a cluster’s computational resources 
but lack abstractions for leveraging the 
distributed memory !!! 
Data reuse is common in many iterative 
machine learning algorithms such as – Page 
Rank, K-means Clustering & Logistic 
Regression.
Motivation behind RDD 
Another use case is when an user runs 
multiple adhoc queries on the same subset of 
data. 
Unfortunately in current frameworks, the 
only way to reuse data between 
computations i.e between two jobs is to write 
to an external storage system e.g. a 
distributed file system such as Amazon S3.
Use cases for RDD 
1. Solving Iterative problems 
Existing Solution – Slow, needs high I/O 
RDD - Fast, in memory
Use cases for RDD 
Example: Suppose I have to look at the 
webserver access logs and look for an 
error_code or certain text.
Use cases for RDD 
Example (cont’d) : I run the above code on server 
which returns a set of files with the words 
looked for grepped, closes the cluster and puts 
the file into an Amazon S3 location specified in 
the script. 
Now we look at the result files and need to 
extract some other text from this file, we will 
need to write or use another set of map-reduce 
code. This might take extra time to fetch the files, 
process and provide the results.
Use cases for RDD 
RDD solves this problem by storing the data 
in memory and providing a ability for the 
user to requery the subset.
Use cases for RDD 
2. Solving Interactive Problems 
The second use case is its usage in interactive 
algorithms such as logistic regression which need 
the data to be re-used.
Challenge for RDD 
The main challenge in designing RDD is 
defining a programming interface that can 
provide fault tolerance efficiently.
Challenges for RDD 
Existing solutions such as distributed shared 
memory, key value stores, & databases offer 
an interface based on fine-grained updates. 
With such systems, the only way to get 
fault tolerance is to replicate the data across 
machines or to log updates across machines. Both 
of these approaches are data intensive. They 
need high bandwidth to move the data over 
the cluster network and large storage.
RDD: Solve 
RDD solves these probems by providing an 
interface based on coarse grained 
transformations such as map, filter and join. 
These transformations apply the same 
operations to many data items. 
This allows them to efficiently provide fault 
tolerance by logging the transformations 
used to build a dataset (i.e. lineage) rather 
than actual data. If a partition of RDD is lost, 
the RDD has enough information about how 
it ..
RDD: Solve 
(Cont’d) was derived from other RDD to 
recompute just that partition. The lost data 
can be recovered quickly, without costly 
replication.
Applications not suitable : RDD 
RDDs would be less suitable for applications 
that make asynchronous fine grained updates 
to shared state, such as a storage system for a 
web application or an incremental web 
crawller. For such applications traditional 
update logging and data checkpointing 
such as databases.
Conclusion RDD 
RDD's goal is to provide an efficient 
programming model for batch 
analytics. 
RDD has been implemented in a system 
called SPARK.

Contenu connexe

Tendances

Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

Tendances (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
NoSql
NoSqlNoSql
NoSql
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to NOSQL databases
Introduction to NOSQL databasesIntroduction to NOSQL databases
Introduction to NOSQL databases
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

En vedette

Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
Mike Brittain
 

En vedette (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Simple Log Analysis and Trending
Simple Log Analysis and TrendingSimple Log Analysis and Trending
Simple Log Analysis and Trending
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Spark RDD : Transformations & Actions
Spark RDD : Transformations & ActionsSpark RDD : Transformations & Actions
Spark RDD : Transformations & Actions
 
หนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internalหนังสือภาษาไทย Spark Internal
หนังสือภาษาไทย Spark Internal
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
BDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 DebriefingBDM26: Spark Summit 2014 Debriefing
BDM26: Spark Summit 2014 Debriefing
 
BDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part IIBDM32: AdamCloud Project - Part II
BDM32: AdamCloud Project - Part II
 

Similaire à Resilient Distributed DataSets - Apache SPARK

Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd Iaetsd
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 

Similaire à Resilient Distributed DataSets - Apache SPARK (20)

Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Study Notes: Apache Spark
Study Notes: Apache SparkStudy Notes: Apache Spark
Study Notes: Apache Spark
 
Iaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasetsIaetsd mapreduce streaming over cassandra datasets
Iaetsd mapreduce streaming over cassandra datasets
 
Cloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control MethodCloud Computing Ambiance using Secluded Access Control Method
Cloud Computing Ambiance using Secluded Access Control Method
 
Database
DatabaseDatabase
Database
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Redis vs Memcached
Redis vs MemcachedRedis vs Memcached
Redis vs Memcached
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
D04501036040
D04501036040D04501036040
D04501036040
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 

Plus de Taposh Roy

Airline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & AirbusAirline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & Airbus
Taposh Roy
 
Energy industry report
Energy industry reportEnergy industry report
Energy industry report
Taposh Roy
 
Consumer electronics bm_retail
Consumer electronics bm_retailConsumer electronics bm_retail
Consumer electronics bm_retail
Taposh Roy
 
Competitor Analysis for RSG Consulting
Competitor Analysis for RSG ConsultingCompetitor Analysis for RSG Consulting
Competitor Analysis for RSG Consulting
Taposh Roy
 
Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)
Taposh Roy
 
M a analysis_roche_genentech
M a analysis_roche_genentechM a analysis_roche_genentech
M a analysis_roche_genentech
Taposh Roy
 
Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)
Taposh Roy
 
Understandingplatform
UnderstandingplatformUnderstandingplatform
Understandingplatform
Taposh Roy
 

Plus de Taposh Roy (20)

Image annotation - Segmentation & Annotation
Image annotation - Segmentation & AnnotationImage annotation - Segmentation & Annotation
Image annotation - Segmentation & Annotation
 
Wal mart health_care_2017_dec
Wal mart health_care_2017_decWal mart health_care_2017_dec
Wal mart health_care_2017_dec
 
Predictive modeling healthcare
Predictive modeling healthcarePredictive modeling healthcare
Predictive modeling healthcare
 
Basic elements-of-strategy-framework
Basic elements-of-strategy-frameworkBasic elements-of-strategy-framework
Basic elements-of-strategy-framework
 
Kaggle bikeshare Competition - Part 1
Kaggle bikeshare Competition  - Part 1Kaggle bikeshare Competition  - Part 1
Kaggle bikeshare Competition - Part 1
 
Airline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & AirbusAirline industry analysis - Boeing & Airbus
Airline industry analysis - Boeing & Airbus
 
Energy industry report
Energy industry reportEnergy industry report
Energy industry report
 
Consumer electronics bm_retail
Consumer electronics bm_retailConsumer electronics bm_retail
Consumer electronics bm_retail
 
Multi Asset Endowment Investment Strategy
Multi Asset Endowment Investment StrategyMulti Asset Endowment Investment Strategy
Multi Asset Endowment Investment Strategy
 
Competitor Analysis for RSG Consulting
Competitor Analysis for RSG ConsultingCompetitor Analysis for RSG Consulting
Competitor Analysis for RSG Consulting
 
Financial Analysis boeing airbus
Financial Analysis boeing airbusFinancial Analysis boeing airbus
Financial Analysis boeing airbus
 
Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)Sprint softbank (Merger Analysis)
Sprint softbank (Merger Analysis)
 
M a analysis_roche_genentech
M a analysis_roche_genentechM a analysis_roche_genentech
M a analysis_roche_genentech
 
Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)Land rover north america (HBS 9-596036)
Land rover north america (HBS 9-596036)
 
American airlines - Value Pricing 1992
American airlines - Value Pricing 1992American airlines - Value Pricing 1992
American airlines - Value Pricing 1992
 
Strategy frameworks-and-models
Strategy frameworks-and-modelsStrategy frameworks-and-models
Strategy frameworks-and-models
 
Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)Tesla in UAE (Financial Strategy)
Tesla in UAE (Financial Strategy)
 
Understandingplatform
UnderstandingplatformUnderstandingplatform
Understandingplatform
 
Disney hbs9 701-035
Disney hbs9 701-035Disney hbs9 701-035
Disney hbs9 701-035
 
Best buy-analysis
Best buy-analysisBest buy-analysis
Best buy-analysis
 

Dernier

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 

Dernier (20)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 

Resilient Distributed DataSets - Apache SPARK

  • 1. RDD – Overview (Resilient Distributed Datasets*) { Nov 1st 2014 Oakland CA By Taposh Dutta Roy * Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 2. Contents • What is RDD • Motivation Behind RDD • Use Cases for RDD • Challenges for RDD • RDD: Solve
  • 3. What is RDD “RDDs are fault tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operations. “ In a nutshell RDDs are a level of abstraction that enable efficient data reuse in a broad range of applications
  • 4. Motivation behind RDD Current frameworks like MapReduce & Dyrad provide a numerous abstractions for accessing a cluster’s computational resources but lack abstractions for leveraging the distributed memory !!! Data reuse is common in many iterative machine learning algorithms such as – Page Rank, K-means Clustering & Logistic Regression.
  • 5. Motivation behind RDD Another use case is when an user runs multiple adhoc queries on the same subset of data. Unfortunately in current frameworks, the only way to reuse data between computations i.e between two jobs is to write to an external storage system e.g. a distributed file system such as Amazon S3.
  • 6. Use cases for RDD 1. Solving Iterative problems Existing Solution – Slow, needs high I/O RDD - Fast, in memory
  • 7. Use cases for RDD Example: Suppose I have to look at the webserver access logs and look for an error_code or certain text.
  • 8. Use cases for RDD Example (cont’d) : I run the above code on server which returns a set of files with the words looked for grepped, closes the cluster and puts the file into an Amazon S3 location specified in the script. Now we look at the result files and need to extract some other text from this file, we will need to write or use another set of map-reduce code. This might take extra time to fetch the files, process and provide the results.
  • 9. Use cases for RDD RDD solves this problem by storing the data in memory and providing a ability for the user to requery the subset.
  • 10. Use cases for RDD 2. Solving Interactive Problems The second use case is its usage in interactive algorithms such as logistic regression which need the data to be re-used.
  • 11. Challenge for RDD The main challenge in designing RDD is defining a programming interface that can provide fault tolerance efficiently.
  • 12. Challenges for RDD Existing solutions such as distributed shared memory, key value stores, & databases offer an interface based on fine-grained updates. With such systems, the only way to get fault tolerance is to replicate the data across machines or to log updates across machines. Both of these approaches are data intensive. They need high bandwidth to move the data over the cluster network and large storage.
  • 13. RDD: Solve RDD solves these probems by providing an interface based on coarse grained transformations such as map, filter and join. These transformations apply the same operations to many data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (i.e. lineage) rather than actual data. If a partition of RDD is lost, the RDD has enough information about how it ..
  • 14. RDD: Solve (Cont’d) was derived from other RDD to recompute just that partition. The lost data can be recovered quickly, without costly replication.
  • 15. Applications not suitable : RDD RDDs would be less suitable for applications that make asynchronous fine grained updates to shared state, such as a storage system for a web application or an incremental web crawller. For such applications traditional update logging and data checkpointing such as databases.
  • 16. Conclusion RDD RDD's goal is to provide an efficient programming model for batch analytics. RDD has been implemented in a system called SPARK.