SlideShare une entreprise Scribd logo
1  sur  42
© 2015 IBM Corporation
RDD Deep Dive
• RDD Basics
• How to create
• RDD Operations
• Lineage
• Partitions
• Shuffle
• Type of RDDs
• Extending RDD
• Caching in RDD
© 2015 IBM Corporation
RDD Basics
• RDD (Resilient Distributed Dataset)
• Distributed collection of Object
• Resilient - Ability to re-compute missing partitions
(node failure)
• Distributed – Split across multiple partitions
• Dataset - Can contain any type, Python/Java/Scala
Object or User defined Object
• Fundamental unit of data in spark
© 2015 IBM Corporation
RDD Basics – How to create
Two ways
 Loading external datasets
 Spark supports wide range of sources
 Access HDFS data through InputFormat & OutputFormat
of Hadoop.
 Supports custom Input/Output format
 Parallelizing collection in driver program
val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”)
textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”)
SparkContext.wholeTextFiles returns (filename,content) pair
val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
© 2015 IBM Corporation
RDD Operations
 Two type of Operations
 Transformation
 Action
 Transformations are lazy, nothing actually happens until an action is
called.
 Action triggers the computation
 Action returns values to driver or writes data to external storage.
© 2015 IBM Corporation
Lazy Evaluation
 Transformation on RDD, don’t get performed immediately
 Spark Internally records metadata to track the operation
 Loading data into RDD also gets lazy evaluated
 Lazy evaluation reduce number of passes on the data by
grouping operations
 MapReduce – Burden on developer to merge the operation,
complex map.
 Failure in Persisting the RDD will re-compute complete lineage
every time.
© 2015 IBM Corporation
RDD In Action
sc.textFile(“hdfs://file.txt")
.flatMap(line=>line.split(" "))
.map(word => (word,1))
.reduceByKey(_+_)
.collect()
I scream you
scream lets all
scream for
icecream!
I wish I were
what I was when
I wished I were
what I am.
I
scream
you
scream
lets
all
scream
for
icecream
(I,1)
(scream,1)
(you,1)
(scream,1)
(lets,1)
(all,1)
(scream,1)
(icecream,1)
(icecream,1)
(scream,3)
(you,1)
(lets,1)
(I,1)
(all,1)
© 2015 IBM Corporation
Lineage Demo
© 2015 IBM Corporation
RDD Partition
 Partition Definition
 Fragments of RDD
 Fragmentation allows Spark to execute in Parallel.
 Partitions are distributed across cluster(Spark worker)
 Partitioning
 Impacts parallelism
 Impacts performance
© 2015 IBM Corporation
Importance of partition Tuning
 Too few partitions
 Less concurrency, unused cores.
 More susceptible to data skew
 Increased memory pressure for groupBy, reduceByKey,
sortByKey, etc.
 Too many partitions
 Framework overhead (more scheduling latency than the
time needed for actual task.)
 Many CPU context-switching
 Need “reasonable number” of partitions
 Commonly between 100 and 10,000 partitions
 Lower bound: At least ~2x number of cores in cluster
 Upper bound: Ensure tasks take at least 100ms
© 2015 IBM Corporation
How Spark Partitions data
 Input data partition
 Shuffle transformations
 Custom Partitioner
© 2015 IBM Corporation
Partition - Input Data
 Spark uses same class as Hadoop to perform Input/Output
 sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat
 Below are Knobs which defines #Partitions
 dfs.block.size – default 128MB(Hadoop 2.0)
 numPartition – can be used to increase number of partition
default is 0 which means 1 partition
 mapreduce.input.fileinputformat.split.minsize – default 1kb
 Partition Size = Max(minsize,Min(goalSize,blockSize)
 goalSize = totalInputSize/numPartitions
 32MB, 0, 1KB, 640MB total size - Defaults
 Max(1kb,Min(640MB,32MB) ) = 20 partitions
 32MB, 30, 1KB , 640MB total size - Want more partition
 Max(1kb,Min(32MB,32MB)) = 32 partition
 32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size partition
 32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size
partition
© 2015 IBM Corporation
Partition - Shuffle transformations
 All shuffle transformation provides parameter
for desire number of partition
 Default Behavior - Spark Uses HashPartitioner.
 If spark.default.parallelism is set , takes that as # of
partitions
 If spark.default.parallelism is not set
largest upstream RDD ‘s number of partition
 Reduces chances of out of memory
1. groupByKey
2. reduceByKey
3. aggregateByKey
4. sortByKey
5. join
6. cogroup
7. cartesian
8. coalesce
9. repartition
10.repartitionAndSort
WithinPartitions
Shuffle Transformation
© 2015 IBM Corporation
Partition - Repartitioning
 RDD provides two operators
 repartition(numPartitions)
 Can Increase/decrease number of partitions
 Internally does shuffle
 expensive due to shuffle
 For decreasing partition use coalesce
 Coalesce(numPartition,Shuffle:[true/false])
 Decreases partitions
 Goes for narrow dependencies
 Avoids shuffle
 In case of drastic reduction may trigger shuffle
© 2015 IBM Corporation
Custom Partitioner
 Partition the data according to use case & data structure
 Custom Partitioning allows control over no of partitions and
distribution of data
 Extends Partitioner class, need to implement getPartitions &
numPartitons
© 2015 IBM Corporation
Partitioning Demo
© 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordCountsWithGroup = rdd
.groupByKey()
.map(t => (t._1, t._2.sum)) .collect()
© 2015 IBM Corporation
Shuffle - GroupByKey Vs ReduceByKey
val wordPairsRDD = rdd.map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD
.reduceByKey(_ + _)
.collect()
© 2015 IBM Corporation
The Shuffle
 Redistribution of data among partition between stages.
 Most of the Performance, Reliability Scalability Issues in Spark occurs
within Shuffle.
 Like MapReduce Spark shuffle uses Pull model.
 Consistently evolved and still an area of research in Spark
© 2015 IBM Corporation
Shuffle Overview
• Spark run job stage by stage.
• Stages are build up by DAGScheduler according to RDD’s
ShuffleDependency
• e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency
• Many operator will create ShuffleRDD / CoGroupedRDD
under the hood.
• Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup
• Many other operator will further call into the above operators
• e.g. various join operator will call CoGroup.
• Each ShuffleDependency maps to one stage in Spark Job and
then will lead to a shuffle.
© 2015 IBM Corporation
You have seen this
join
union
groupBy
Stage3
Stage1
Stage2
A: B:
C: D:
map
E:
F:
G:
© 2015 IBM Corporation
Shuffle is Expensive
• When doing shuffle, data no longer stay in memory only, gets
written to disk.
• For spark, shuffle process might involve
• Data partition: which might involve very expensive data
sorting works etc.
• Data ser/deser: to enable data been transfer through
network or across processes.
• Data compression: to reduce IO bandwidth etc.
• Disk IO: probably multiple times on one single data block
• E.g. Shuffle Spill, Merge combine
© 2015 IBM Corporation
Shuffle History
 Shuffle module in Spark has evolved over time.
 Spark(0.6-0.7) – Same code path as RDD’s persist method.
MEMORY_ONLY , DISK_ONLY options available.
 Spark (0.8-0.9)
- Separate code for shuffle, ShuffleBlockManager &
BlockObjectWriter for shuffle only.
- Shuffle optimization - Consolidate Shuffle Write.
 Spark 1.0 – Introduced pluggable shuffle framework
 Spark 1.1 – Sort based Shuffle Implementation
 Spark 1.2 - Netty transfer Implementation. Sort based shuffle is
default now.
 Spark 1.2+ - External shuffle service etc.
© 2015 IBM Corporation
Understanding Shuffle
 Input Aggregation
 Types of Shuffle
 Hash based
 Basic Hash Shuffle
 Consolidate Hash Shuffle
 Sort Based Shuffle
© 2015 IBM Corporation
Input Aggregation
 Like MapReduce, Spark involves aggregate(Combiner) on map side.
 Aggregation is done in ShuffleMapTask using
 AppendOnlyMap (In Memory Hash Table combiner)
 Key’s are never removed , values gets updated
 ExternalAppendOnlyMap (In Memory and disk Hash Table combiner)
 A Hash Map which can spill to disk
 Append Only Map that spill data to disk if insufficient memory
 Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before
writing to a shuffle file.
© 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle
 Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data
for reducers
 Each map task writes each bucket to a file.
 #Map Tasks = M
 #Reduce Tasks = R
 #Shuffle File = M*R , #In-Memory Buffer = M*R
© 2015 IBM Corporation
Shuffle Types – Basic Hash Shuffle
 Problem
 Lets use 100KB as buffer size
 We have 10000 reducers
 10 Mapper tasks Per Executor
 In-Memory Buffer size will = 100KB*10000*10
 Buffer need will be 10GB/Executor
 This huge amount of Buffer is not acceptable and this
Implementation cant support 10000 reducer.
© 2015 IBM Corporation
Shuffle Types – Consolidate Hash Shuffle
 Solution to decrease the IN-Memory Buffer size , No of File.
 Within Executor, Map Tasks writes each Bucket to a Segment of the file.
 #Shuffle file/Executor = #Reducers,
 # In-Memory Buffer/ Executor=#R( Reducers)
© 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle
 Consolidate Hash Shuffle needs one file for each reducer.
- Total C*R intermediate file , C = # of executor running map
tasks
 Still too many files(e.g ~10k reducers),
 Need significant memory for compression & serialization
buffer.
 Too many open files issue.
 Sort Based Shuflle is similar to map-side shuffle from
MapReduce
 Introduced in Spark 1.1 , now its default shuffle
© 2015 IBM Corporation
Shuffle Types – Sort Based Shuffle
 Map output records from each task are kept in memory till they can fit.
 Once full , data gets sorted by partition and spilled to single file.
 Each Map task generate 1 data file and one index file
 Utilize external sorter to do the sort work
 If map side combiner is required data will be sorted by key and partition
otherwise only by partition
 #reducer <=200, no sorting uses hash approach, generate file per reducer
and merge them into a single file
© 2015 IBM Corporation
Shuffle Reader
 On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader
 On reducer side a set of thread fetch remote output map blocks
 Once block comes its records are de-serialized and passed into a
result queue.
 Records are passed to ExternalAppendOnlyMap , for ordering
operation like sortByKey records are passed to externalSorter.
20
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Bucket
Reduce Task
Aggregator Aggregator Aggregator Aggregator
Reduce Task Reduce Task Reduce Task
© 2015 IBM Corporation
Type of RDDS - RDD Interface
Base for all RDDs (RDD.scala), consists of
 A Set of partitions (“splits” in Hadoop)
 A List of dependencies on parent RDDs
 A Function to compute the partition from its
parents
 Optional preferred locations for each partition
 A Partitioner defines strategy for partitionig
hash/range
 Basic operations like map, filter, persist etc
Partitions
Dependencies
Compute
PreferredLocations
Partitioner
map,filter,persist
s
Lineage
Optimized execution
Operations
© 2015 IBM Corporation
Example: HadoopRDD
 partitions = one per HDFS block
 dependencies = none
 compute(partition) = read corresponding block
 preferredLocations(part) = HDFS block location
 partitioner = none
© 2015 IBM Corporation
Example: MapPartitionRDD
 partitions = Parent Partition
 dependencies = “one-to-one “parent RDD
 compute(partition) = apply map on parent
 preferredLocations(part) = none (ask parent)
 partitioner = none
© 2015 IBM Corporation
Example: CoGroupRDD
 partitions = one per reduce task
 dependencies = could be narrow or wide dependency
 compute(partition) = read and join shuffled data
 preferredLocations(part) = none
 partitioner = HashPartitioner(numTasks)
© 2015 IBM Corporation
Extending RDDs
Extend RDDs to
 To add Domain specific transformation/actions
 Allow developer to express domain specific calculation in
cleaner way
 Improves code readability
 Easy to maintain
 Domain specific RDD
 Better way to express domain specific data
 Better control on partitioning and distribution
 Way to add new Input data source
© 2015 IBM Corporation
How to Extend
 Add custom operators to RDD
 Use scala Impilicits
 Feels and works like built in operator
 You can add operator to Specific RDD or to all
 Custom RDD
 Extend RDD API to create our own RDD
 Implement compute & getPartitions abstract method
© 2015 IBM Corporation
Implicit Class
 Creates an extension method to existing type
 Introduced in Scala 2.10
 Implicits are compile time checked. Implicit class gets resolved
into a class definition with implict conversion
 We will use Implicit to add new method in RDD
© 2015 IBM Corporation
Adding new Operator to RDD
 We will use Scala Implicit feature to add a new operator to an
existingRDD
 This operator will show up only in our RDD
 Implicit conversions are handled by Scala
© 2015 IBM Corporation
Custom RDD Implementation
 Extending RDD allow you to create your own custom RDD
structure
 Custom RDD allow control on computation, change partition &
locality information
© 2015 IBM Corporation
Caching in RDD
 Spark allows caching/Persisting entire dataset in memory
 Persisting RDD in cache
 First time when it is computed it will be kept in memory
 Reuse the the cache partition in next set of operation
 Fault-tolerant, recomputed in case of failure
 Caching is key tool for interactive and iterative algorithm
 Persist support different storage level
 Storage level - In memory , Disk or both , Techyon
 Serialized Vs Deserialized
© 2015 IBM Corporation
Caching In RDD
 Spark Context tracks persistent RDDs
 Block Manager puts partition in memory when first evaluated
 Cache is lazy evaluation , no caching without an action.
 Shuffle also keeps its data in Cache after shuffle operations.
 We still need to cache shuffle RDDs
© 2015 IBM Corporation
Caching Demo

Contenu connexe

Tendances

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsAlessandro Menabò
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsLambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsSimon Su
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 

Tendances (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark core
Spark coreSpark core
Spark core
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Spark Deep Dive
Spark Deep DiveSpark Deep Dive
Spark Deep Dive
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus AppsLambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 

En vedette

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profitPawel Szulc
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2Stefanie Zhao
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemrhatr
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Ashutosh Sonaliya
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式Xuan-Chao Huang
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
 

En vedette (20)

Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Writing your own RDD for fun and profit
Writing your own RDD for fun and profitWriting your own RDD for fun and profit
Writing your own RDD for fun and profit
 
BDAS RDD study report v1.2
BDAS RDD study report v1.2BDAS RDD study report v1.2
BDAS RDD study report v1.2
 
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystemUnikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
 
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset TransformsType Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use CaseThink Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraEscape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
 
Spark in 15 min
Spark in 15 minSpark in 15 min
Spark in 15 min
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 

Similaire à IBM Spark Meetup - RDD & Spark Basics

Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxArchitecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxshaikshazil1
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015Yousun Jeong
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Aerospike
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkVincent Poncet
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1Hassy Veldstra
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Boni Bruno
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooksAndrey Vykhodtsev
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 

Similaire à IBM Spark Meetup - RDD & Spark Basics (20)

Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptxArchitecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptx
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
 
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 

Dernier

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

IBM Spark Meetup - RDD & Spark Basics

  • 1. © 2015 IBM Corporation RDD Deep Dive • RDD Basics • How to create • RDD Operations • Lineage • Partitions • Shuffle • Type of RDDs • Extending RDD • Caching in RDD
  • 2. © 2015 IBM Corporation RDD Basics • RDD (Resilient Distributed Dataset) • Distributed collection of Object • Resilient - Ability to re-compute missing partitions (node failure) • Distributed – Split across multiple partitions • Dataset - Can contain any type, Python/Java/Scala Object or User defined Object • Fundamental unit of data in spark
  • 3. © 2015 IBM Corporation RDD Basics – How to create Two ways  Loading external datasets  Spark supports wide range of sources  Access HDFS data through InputFormat & OutputFormat of Hadoop.  Supports custom Input/Output format  Parallelizing collection in driver program val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”) textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”) SparkContext.wholeTextFiles returns (filename,content) pair val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
  • 4. © 2015 IBM Corporation RDD Operations  Two type of Operations  Transformation  Action  Transformations are lazy, nothing actually happens until an action is called.  Action triggers the computation  Action returns values to driver or writes data to external storage.
  • 5. © 2015 IBM Corporation Lazy Evaluation  Transformation on RDD, don’t get performed immediately  Spark Internally records metadata to track the operation  Loading data into RDD also gets lazy evaluated  Lazy evaluation reduce number of passes on the data by grouping operations  MapReduce – Burden on developer to merge the operation, complex map.  Failure in Persisting the RDD will re-compute complete lineage every time.
  • 6. © 2015 IBM Corporation RDD In Action sc.textFile(“hdfs://file.txt") .flatMap(line=>line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .collect() I scream you scream lets all scream for icecream! I wish I were what I was when I wished I were what I am. I scream you scream lets all scream for icecream (I,1) (scream,1) (you,1) (scream,1) (lets,1) (all,1) (scream,1) (icecream,1) (icecream,1) (scream,3) (you,1) (lets,1) (I,1) (all,1)
  • 7. © 2015 IBM Corporation Lineage Demo
  • 8. © 2015 IBM Corporation RDD Partition  Partition Definition  Fragments of RDD  Fragmentation allows Spark to execute in Parallel.  Partitions are distributed across cluster(Spark worker)  Partitioning  Impacts parallelism  Impacts performance
  • 9. © 2015 IBM Corporation Importance of partition Tuning  Too few partitions  Less concurrency, unused cores.  More susceptible to data skew  Increased memory pressure for groupBy, reduceByKey, sortByKey, etc.  Too many partitions  Framework overhead (more scheduling latency than the time needed for actual task.)  Many CPU context-switching  Need “reasonable number” of partitions  Commonly between 100 and 10,000 partitions  Lower bound: At least ~2x number of cores in cluster  Upper bound: Ensure tasks take at least 100ms
  • 10. © 2015 IBM Corporation How Spark Partitions data  Input data partition  Shuffle transformations  Custom Partitioner
  • 11. © 2015 IBM Corporation Partition - Input Data  Spark uses same class as Hadoop to perform Input/Output  sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat  Below are Knobs which defines #Partitions  dfs.block.size – default 128MB(Hadoop 2.0)  numPartition – can be used to increase number of partition default is 0 which means 1 partition  mapreduce.input.fileinputformat.split.minsize – default 1kb  Partition Size = Max(minsize,Min(goalSize,blockSize)  goalSize = totalInputSize/numPartitions  32MB, 0, 1KB, 640MB total size - Defaults  Max(1kb,Min(640MB,32MB) ) = 20 partitions  32MB, 30, 1KB , 640MB total size - Want more partition  Max(1kb,Min(32MB,32MB)) = 32 partition  32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size partition  32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size partition
  • 12. © 2015 IBM Corporation Partition - Shuffle transformations  All shuffle transformation provides parameter for desire number of partition  Default Behavior - Spark Uses HashPartitioner.  If spark.default.parallelism is set , takes that as # of partitions  If spark.default.parallelism is not set largest upstream RDD ‘s number of partition  Reduces chances of out of memory 1. groupByKey 2. reduceByKey 3. aggregateByKey 4. sortByKey 5. join 6. cogroup 7. cartesian 8. coalesce 9. repartition 10.repartitionAndSort WithinPartitions Shuffle Transformation
  • 13. © 2015 IBM Corporation Partition - Repartitioning  RDD provides two operators  repartition(numPartitions)  Can Increase/decrease number of partitions  Internally does shuffle  expensive due to shuffle  For decreasing partition use coalesce  Coalesce(numPartition,Shuffle:[true/false])  Decreases partitions  Goes for narrow dependencies  Avoids shuffle  In case of drastic reduction may trigger shuffle
  • 14. © 2015 IBM Corporation Custom Partitioner  Partition the data according to use case & data structure  Custom Partitioning allows control over no of partitions and distribution of data  Extends Partitioner class, need to implement getPartitions & numPartitons
  • 15. © 2015 IBM Corporation Partitioning Demo
  • 16. © 2015 IBM Corporation Shuffle - GroupByKey Vs ReduceByKey val wordCountsWithGroup = rdd .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
  • 17. © 2015 IBM Corporation Shuffle - GroupByKey Vs ReduceByKey val wordPairsRDD = rdd.map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
  • 18. © 2015 IBM Corporation The Shuffle  Redistribution of data among partition between stages.  Most of the Performance, Reliability Scalability Issues in Spark occurs within Shuffle.  Like MapReduce Spark shuffle uses Pull model.  Consistently evolved and still an area of research in Spark
  • 19. © 2015 IBM Corporation Shuffle Overview • Spark run job stage by stage. • Stages are build up by DAGScheduler according to RDD’s ShuffleDependency • e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency • Many operator will create ShuffleRDD / CoGroupedRDD under the hood. • Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup • Many other operator will further call into the above operators • e.g. various join operator will call CoGroup. • Each ShuffleDependency maps to one stage in Spark Job and then will lead to a shuffle.
  • 20. © 2015 IBM Corporation You have seen this join union groupBy Stage3 Stage1 Stage2 A: B: C: D: map E: F: G:
  • 21. © 2015 IBM Corporation Shuffle is Expensive • When doing shuffle, data no longer stay in memory only, gets written to disk. • For spark, shuffle process might involve • Data partition: which might involve very expensive data sorting works etc. • Data ser/deser: to enable data been transfer through network or across processes. • Data compression: to reduce IO bandwidth etc. • Disk IO: probably multiple times on one single data block • E.g. Shuffle Spill, Merge combine
  • 22. © 2015 IBM Corporation Shuffle History  Shuffle module in Spark has evolved over time.  Spark(0.6-0.7) – Same code path as RDD’s persist method. MEMORY_ONLY , DISK_ONLY options available.  Spark (0.8-0.9) - Separate code for shuffle, ShuffleBlockManager & BlockObjectWriter for shuffle only. - Shuffle optimization - Consolidate Shuffle Write.  Spark 1.0 – Introduced pluggable shuffle framework  Spark 1.1 – Sort based Shuffle Implementation  Spark 1.2 - Netty transfer Implementation. Sort based shuffle is default now.  Spark 1.2+ - External shuffle service etc.
  • 23. © 2015 IBM Corporation Understanding Shuffle  Input Aggregation  Types of Shuffle  Hash based  Basic Hash Shuffle  Consolidate Hash Shuffle  Sort Based Shuffle
  • 24. © 2015 IBM Corporation Input Aggregation  Like MapReduce, Spark involves aggregate(Combiner) on map side.  Aggregation is done in ShuffleMapTask using  AppendOnlyMap (In Memory Hash Table combiner)  Key’s are never removed , values gets updated  ExternalAppendOnlyMap (In Memory and disk Hash Table combiner)  A Hash Map which can spill to disk  Append Only Map that spill data to disk if insufficient memory  Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before writing to a shuffle file.
  • 25. © 2015 IBM Corporation Shuffle Types – Basic Hash Shuffle  Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data for reducers  Each map task writes each bucket to a file.  #Map Tasks = M  #Reduce Tasks = R  #Shuffle File = M*R , #In-Memory Buffer = M*R
  • 26. © 2015 IBM Corporation Shuffle Types – Basic Hash Shuffle  Problem  Lets use 100KB as buffer size  We have 10000 reducers  10 Mapper tasks Per Executor  In-Memory Buffer size will = 100KB*10000*10  Buffer need will be 10GB/Executor  This huge amount of Buffer is not acceptable and this Implementation cant support 10000 reducer.
  • 27. © 2015 IBM Corporation Shuffle Types – Consolidate Hash Shuffle  Solution to decrease the IN-Memory Buffer size , No of File.  Within Executor, Map Tasks writes each Bucket to a Segment of the file.  #Shuffle file/Executor = #Reducers,  # In-Memory Buffer/ Executor=#R( Reducers)
  • 28. © 2015 IBM Corporation Shuffle Types – Sort Based Shuffle  Consolidate Hash Shuffle needs one file for each reducer. - Total C*R intermediate file , C = # of executor running map tasks  Still too many files(e.g ~10k reducers),  Need significant memory for compression & serialization buffer.  Too many open files issue.  Sort Based Shuflle is similar to map-side shuffle from MapReduce  Introduced in Spark 1.1 , now its default shuffle
  • 29. © 2015 IBM Corporation Shuffle Types – Sort Based Shuffle  Map output records from each task are kept in memory till they can fit.  Once full , data gets sorted by partition and spilled to single file.  Each Map task generate 1 data file and one index file  Utilize external sorter to do the sort work  If map side combiner is required data will be sorted by key and partition otherwise only by partition  #reducer <=200, no sorting uses hash approach, generate file per reducer and merge them into a single file
  • 30. © 2015 IBM Corporation Shuffle Reader  On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader  On reducer side a set of thread fetch remote output map blocks  Once block comes its records are de-serialized and passed into a result queue.  Records are passed to ExternalAppendOnlyMap , for ordering operation like sortByKey records are passed to externalSorter. 20 Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Reduce Task Aggregator Aggregator Aggregator Aggregator Reduce Task Reduce Task Reduce Task
  • 31. © 2015 IBM Corporation Type of RDDS - RDD Interface Base for all RDDs (RDD.scala), consists of  A Set of partitions (“splits” in Hadoop)  A List of dependencies on parent RDDs  A Function to compute the partition from its parents  Optional preferred locations for each partition  A Partitioner defines strategy for partitionig hash/range  Basic operations like map, filter, persist etc Partitions Dependencies Compute PreferredLocations Partitioner map,filter,persist s Lineage Optimized execution Operations
  • 32. © 2015 IBM Corporation Example: HadoopRDD  partitions = one per HDFS block  dependencies = none  compute(partition) = read corresponding block  preferredLocations(part) = HDFS block location  partitioner = none
  • 33. © 2015 IBM Corporation Example: MapPartitionRDD  partitions = Parent Partition  dependencies = “one-to-one “parent RDD  compute(partition) = apply map on parent  preferredLocations(part) = none (ask parent)  partitioner = none
  • 34. © 2015 IBM Corporation Example: CoGroupRDD  partitions = one per reduce task  dependencies = could be narrow or wide dependency  compute(partition) = read and join shuffled data  preferredLocations(part) = none  partitioner = HashPartitioner(numTasks)
  • 35. © 2015 IBM Corporation Extending RDDs Extend RDDs to  To add Domain specific transformation/actions  Allow developer to express domain specific calculation in cleaner way  Improves code readability  Easy to maintain  Domain specific RDD  Better way to express domain specific data  Better control on partitioning and distribution  Way to add new Input data source
  • 36. © 2015 IBM Corporation How to Extend  Add custom operators to RDD  Use scala Impilicits  Feels and works like built in operator  You can add operator to Specific RDD or to all  Custom RDD  Extend RDD API to create our own RDD  Implement compute & getPartitions abstract method
  • 37. © 2015 IBM Corporation Implicit Class  Creates an extension method to existing type  Introduced in Scala 2.10  Implicits are compile time checked. Implicit class gets resolved into a class definition with implict conversion  We will use Implicit to add new method in RDD
  • 38. © 2015 IBM Corporation Adding new Operator to RDD  We will use Scala Implicit feature to add a new operator to an existingRDD  This operator will show up only in our RDD  Implicit conversions are handled by Scala
  • 39. © 2015 IBM Corporation Custom RDD Implementation  Extending RDD allow you to create your own custom RDD structure  Custom RDD allow control on computation, change partition & locality information
  • 40. © 2015 IBM Corporation Caching in RDD  Spark allows caching/Persisting entire dataset in memory  Persisting RDD in cache  First time when it is computed it will be kept in memory  Reuse the the cache partition in next set of operation  Fault-tolerant, recomputed in case of failure  Caching is key tool for interactive and iterative algorithm  Persist support different storage level  Storage level - In memory , Disk or both , Techyon  Serialized Vs Deserialized
  • 41. © 2015 IBM Corporation Caching In RDD  Spark Context tracks persistent RDDs  Block Manager puts partition in memory when first evaluated  Cache is lazy evaluation , no caching without an action.  Shuffle also keeps its data in Cache after shuffle operations.  We still need to cache shuffle RDDs
  • 42. © 2015 IBM Corporation Caching Demo