SlideShare a Scribd company logo
1 of 42
Download to read offline
Anatomy of RDD
A deep dive into the RDD data structure
https://github.com/phatak-dev/anatomy-of-rdd
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● What is RDD?
● Immutable and Distributed
● Partitions
● Laziness
● Caching
● Extending Spark API
What is RDD?
Resilient Distributed Dataset
- A big collection of data with following
properties
- Immutable
- Distributed
- Lazily evaluated
- Type inferred
- Cacheable
Immutable and Distributed
Partitions
● Logical division of data
● Derived from Hadoop Map/Reduce
● All Input,Intermediate and output data will be
represented as partitions
● Partitions are basic unit of parallelism
● RDD data is just collection of partitions
Partition from Input Data
Data in HDFS
Chunk 1 Chunk 2 Chunk 3
Input Format
RDD
Partition 1 Partition 2 Partition 3
Partition example
Partition and Immutability
● All partitions are immutable
● Every transformation generates new partition
● Partition immutability driven by underneath
storage like HDFS
● Partition immutability allows for fault
recovery
Partitions and Distribution
● Partitions derived from HDFS are distributed
by default
● Partitions also location aware
● Location awareness of partitions allow for
data locality
● For computed data, using caching we can
distribute in memory also
Accessing partitions
● We can access partition together rather
single row at a time
● mapParititons API of RDD allows us that
● Accessing partition at a time allows us to do
some partionwise operation which cannot be
done by accessing single row.
Map partition example
Partition for transformed Data
● Partitioning will be different for key/value
pairs that are generated by shuffle operation
● Partitioning is driven by partitioner specified
● By default HashPartitioner is used
● You can use your own partitioner also
Hash Partitioning
Input RDD
Partition 1 Partition 2 Partition 3
Hash(key)%no.of.partitions
Ouptut RDD Partition 1 Partition 2
Hash partition example
Custom Partitioner
● Partition the data according to your data
structure
● Custom partitioning allows control over no of
partitions and the distribution of data across
when grouping or reducing is done
Custom partition example
Look up operation
● Partitioning allows faster lookups
● Lookup operation allows to look up for a
given value by specifying the key
● Using partitioner, lookup determines which
partition look for
● Then it only need to look in that partition
● If no partition is specified, it will fallback to
filter
Lookup example
Laziness
Parent(Dependency)
● Each RDD has access to it’s parent RDD
● Nil is the value of parent for first RDD
● Before computing it’s value, it always
computes it’s parent
● This chain of running allows for laziness
Sub classing
● Each spark operator, creates an instance of
specific sub class of RDD
● map operator results in MappedRDD,
flatMap in FlatMappedRDD etc
● Subclass allows RDD to remember the
operation that is performed in the
transformation
RDD transformations
val dataRDD =
sc.textFile(args
(1))
val splitRDD =
dataRDD.
flatMap(value =>
value.split(“ “)
Hadoop RDD
splitRDD:
FlatMappedRDD
dataRDD :
MappedRDD
Nil
Laziness example
Compute
● Compute is the function for evaluation of
each partition in RDD
● Compute is an abstract method of RDD
● Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method
RDD actions
val dataRDD = sc.
textFile(args(1))
val flatMapRDD =
dataRDD.flatMap
(value => value.split(“
“)
flatMapRDD.collect()
Hadoop RDD
FlatMap RDD
Mapped RDD
Nil
runJob
compute
compute
compute
runJob API
● runJob API of RDD is the api to implement
actions
● runJob allows to take each partition and
allow you evaluate
● All spark actions internally use runJob api.
Run job example
Caching
● cache internally uses persist API
● persist sets a specific storage level for a
given RDD
● Spark context tracks persistent RDD
● When first evaluates, partition will be put into
memory by block manager
Block manager
● Handles all in memory data in spark
● Responsible for
○ Cached Data ( BlockRDD)
○ Shuffle Data
○ Broadcast data
● Partition will be stored in Block with id (RDD.
id, partition_index)
How caching works?
● Partition iterator checks the storage level
● if Storage level is set it calls
cacheManager.getOrCompute(partition)
● as iterator is run for each RDD evaluation, its
transparent to user
Caching example
Extending Spark API
Why?
● Domain specific operators
○ Allows developer to express domain specific
calculation in cleaner way
○ Improves code readability
○ Easy to maintain
● Domain specific RDD’s
○ Better way of expressing domain data
○ Control over partitioning and distribution
DSL Example
● salesRecordRDD: RDD[SalesRecord]
● To make sum of sales
○ In plain spark
■ salesRecord.map(_.itemValue).sum
○ In our dsl
■ salesRecord.totalSales
● Our dsl hides internal representation and
improves readability
How to Extend
● Custom operators to RDD
○ Domain specific operators to specific RDD’s
○ Uses scala implicit mechanism
○ Feels and works like built in operator
● Custom RDD
○ Extend RDD API to create our own RDD
○ Combined with RDD it’s very powerful
Implicits in Scala
● A way of extending Types on the fly
● Implicits also used to pass the parameters
functions which are read from environment
● In our example, we just use the type
extension facility
● All implicits are compile time checked.
Implicits example
Adding operators to RDD’s
● We use scala implicit facility, to add the
custom operators on our RDD
● These operators only show up in our RDD’s
● All implicit conversion are handled by Scala
not by Spark
● Spark internally use similar tricks for
PairRDD’s
Custom operator Example
Extending RDD
● Extending RDD API allows to create our own
custom RDD structure
● Custom RDD’s allows control over
computation
● You can change partitions, locality and
evaluation depending upon your requirement
Discount RDD example

More Related Content

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 

Anatomy of RDD : Deep dive into Spark RDD abstraction

  • 1. Anatomy of RDD A deep dive into the RDD data structure https://github.com/phatak-dev/anatomy-of-rdd
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● What is RDD? ● Immutable and Distributed ● Partitions ● Laziness ● Caching ● Extending Spark API
  • 4. What is RDD? Resilient Distributed Dataset - A big collection of data with following properties - Immutable - Distributed - Lazily evaluated - Type inferred - Cacheable
  • 6. Partitions ● Logical division of data ● Derived from Hadoop Map/Reduce ● All Input,Intermediate and output data will be represented as partitions ● Partitions are basic unit of parallelism ● RDD data is just collection of partitions
  • 7. Partition from Input Data Data in HDFS Chunk 1 Chunk 2 Chunk 3 Input Format RDD Partition 1 Partition 2 Partition 3
  • 9. Partition and Immutability ● All partitions are immutable ● Every transformation generates new partition ● Partition immutability driven by underneath storage like HDFS ● Partition immutability allows for fault recovery
  • 10. Partitions and Distribution ● Partitions derived from HDFS are distributed by default ● Partitions also location aware ● Location awareness of partitions allow for data locality ● For computed data, using caching we can distribute in memory also
  • 11. Accessing partitions ● We can access partition together rather single row at a time ● mapParititons API of RDD allows us that ● Accessing partition at a time allows us to do some partionwise operation which cannot be done by accessing single row.
  • 13. Partition for transformed Data ● Partitioning will be different for key/value pairs that are generated by shuffle operation ● Partitioning is driven by partitioner specified ● By default HashPartitioner is used ● You can use your own partitioner also
  • 14. Hash Partitioning Input RDD Partition 1 Partition 2 Partition 3 Hash(key)%no.of.partitions Ouptut RDD Partition 1 Partition 2
  • 16. Custom Partitioner ● Partition the data according to your data structure ● Custom partitioning allows control over no of partitions and the distribution of data across when grouping or reducing is done
  • 18. Look up operation ● Partitioning allows faster lookups ● Lookup operation allows to look up for a given value by specifying the key ● Using partitioner, lookup determines which partition look for ● Then it only need to look in that partition ● If no partition is specified, it will fallback to filter
  • 21. Parent(Dependency) ● Each RDD has access to it’s parent RDD ● Nil is the value of parent for first RDD ● Before computing it’s value, it always computes it’s parent ● This chain of running allows for laziness
  • 22. Sub classing ● Each spark operator, creates an instance of specific sub class of RDD ● map operator results in MappedRDD, flatMap in FlatMappedRDD etc ● Subclass allows RDD to remember the operation that is performed in the transformation
  • 23. RDD transformations val dataRDD = sc.textFile(args (1)) val splitRDD = dataRDD. flatMap(value => value.split(“ “) Hadoop RDD splitRDD: FlatMappedRDD dataRDD : MappedRDD Nil
  • 25. Compute ● Compute is the function for evaluation of each partition in RDD ● Compute is an abstract method of RDD ● Each sub class of RDD like MappedRDD, FilteredRDD have to override this method
  • 26. RDD actions val dataRDD = sc. textFile(args(1)) val flatMapRDD = dataRDD.flatMap (value => value.split(“ “) flatMapRDD.collect() Hadoop RDD FlatMap RDD Mapped RDD Nil runJob compute compute compute
  • 27. runJob API ● runJob API of RDD is the api to implement actions ● runJob allows to take each partition and allow you evaluate ● All spark actions internally use runJob api.
  • 29. Caching ● cache internally uses persist API ● persist sets a specific storage level for a given RDD ● Spark context tracks persistent RDD ● When first evaluates, partition will be put into memory by block manager
  • 30. Block manager ● Handles all in memory data in spark ● Responsible for ○ Cached Data ( BlockRDD) ○ Shuffle Data ○ Broadcast data ● Partition will be stored in Block with id (RDD. id, partition_index)
  • 31. How caching works? ● Partition iterator checks the storage level ● if Storage level is set it calls cacheManager.getOrCompute(partition) ● as iterator is run for each RDD evaluation, its transparent to user
  • 34. Why? ● Domain specific operators ○ Allows developer to express domain specific calculation in cleaner way ○ Improves code readability ○ Easy to maintain ● Domain specific RDD’s ○ Better way of expressing domain data ○ Control over partitioning and distribution
  • 35. DSL Example ● salesRecordRDD: RDD[SalesRecord] ● To make sum of sales ○ In plain spark ■ salesRecord.map(_.itemValue).sum ○ In our dsl ■ salesRecord.totalSales ● Our dsl hides internal representation and improves readability
  • 36. How to Extend ● Custom operators to RDD ○ Domain specific operators to specific RDD’s ○ Uses scala implicit mechanism ○ Feels and works like built in operator ● Custom RDD ○ Extend RDD API to create our own RDD ○ Combined with RDD it’s very powerful
  • 37. Implicits in Scala ● A way of extending Types on the fly ● Implicits also used to pass the parameters functions which are read from environment ● In our example, we just use the type extension facility ● All implicits are compile time checked.
  • 39. Adding operators to RDD’s ● We use scala implicit facility, to add the custom operators on our RDD ● These operators only show up in our RDD’s ● All implicit conversion are handled by Scala not by Spark ● Spark internally use similar tricks for PairRDD’s
  • 41. Extending RDD ● Extending RDD API allows to create our own custom RDD structure ● Custom RDD’s allows control over computation ● You can change partitions, locality and evaluation depending upon your requirement