Soumettre la recherche
Mettre en ligne
IBM Spark Meetup - RDD & Spark Basics
•
Télécharger en tant que PPTX, PDF
•
7 j'aime
•
1,115 vues
Satya Narayan
Suivre
Deep dive into Spark RDD
Lire moins
Lire la suite
Données & analyses
Signaler
Partager
Signaler
Partager
1 sur 42
Télécharger maintenant
Recommandé
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
Apache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
RDD
RDD
Tien-Yang (Aiden) Wu
Transformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
Recommandé
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
Spark & Spark Streaming Internals - Nov 15 (1)
Spark & Spark Streaming Internals - Nov 15 (1)
Akhil Das
Apache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
RDD
RDD
Tien-Yang (Aiden) Wu
Transformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Apache Spark overview
Apache Spark overview
DataArt
Spark core
Spark core
Freeman Zhang
Map reduce vs spark
Map reduce vs spark
Tudor Lapusan
Resilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
Apache Spark Introduction
Apache Spark Introduction
sudhakara st
BDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
Apache Spark
Apache Spark
Uwe Printz
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
Intro to Apache Spark
Intro to Apache Spark
Robert Sanders
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
Spark Deep Dive
Spark Deep Dive
Corey Nolet
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
Simon Su
Introduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
Introduction to spark
Introduction to spark
Duyhai Doan
Spark shuffle introduction
Spark shuffle introduction
colorant
Spark and shark
Spark and shark
DataWorks Summit
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
Writing your own RDD for fun and profit
Writing your own RDD for fun and profit
Pawel Szulc
Contenu connexe
Tendances
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Apache Spark overview
Apache Spark overview
DataArt
Spark core
Spark core
Freeman Zhang
Map reduce vs spark
Map reduce vs spark
Tudor Lapusan
Resilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
Apache Spark Introduction
Apache Spark Introduction
sudhakara st
BDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
Apache Spark
Apache Spark
Uwe Printz
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
Intro to Apache Spark
Intro to Apache Spark
Robert Sanders
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
Spark Deep Dive
Spark Deep Dive
Corey Nolet
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
YahooTechConference
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
Simon Su
Introduction to Apache Spark
Introduction to Apache Spark
Datio Big Data
Introduction to spark
Introduction to spark
Duyhai Doan
Spark shuffle introduction
Spark shuffle introduction
colorant
Spark and shark
Spark and shark
DataWorks Summit
Tendances
(20)
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark overview
Apache Spark overview
Spark core
Spark core
Map reduce vs spark
Map reduce vs spark
Resilient Distributed Datasets
Resilient Distributed Datasets
Apache Spark Introduction
Apache Spark Introduction
BDM25 - Spark runtime internal
BDM25 - Spark runtime internal
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Apache Spark
Apache Spark
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Intro to Apache Spark
Intro to Apache Spark
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Spark Deep Dive
Spark Deep Dive
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
Lambda Architecture using Google Cloud plus Apps
Lambda Architecture using Google Cloud plus Apps
Introduction to Apache Spark
Introduction to Apache Spark
Introduction to spark
Introduction to spark
Spark shuffle introduction
Spark shuffle introduction
Spark and shark
Spark and shark
En vedette
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
Writing your own RDD for fun and profit
Writing your own RDD for fun and profit
Pawel Szulc
BDAS RDD study report v1.2
BDAS RDD study report v1.2
Stefanie Zhao
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
rhatr
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
Gabriele Modena
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Robbie Strickland
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
Resilient Distributed Datasets
Resilient Distributed Datasets
Gabriele Modena
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
rhatr
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
Think Like Spark
Think Like Spark
Alpine Data
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
Hadoop to spark_v2
Hadoop to spark_v2
elephantscale
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
Spark in 15 min
Spark in 15 min
Christophe Marchal
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
En vedette
(20)
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Writing your own RDD for fun and profit
Writing your own RDD for fun and profit
BDAS RDD study report v1.2
BDAS RDD study report v1.2
Unikernels: in search of a killer app and a killer ecosystem
Unikernels: in search of a killer app and a killer ecosystem
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Type Checking Scala Spark Datasets: Dataset Transforms
Type Checking Scala Spark Datasets: Dataset Transforms
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
臺灣高中數學講義 - 第一冊 - 數與式
臺灣高中數學講義 - 第一冊 - 數與式
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark: Some Spark Concepts and a Use Case
Resilient Distributed Datasets
Resilient Distributed Datasets
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Think Like Spark
Think Like Spark
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Hadoop to spark_v2
Hadoop to spark_v2
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Spark in 15 min
Spark in 15 min
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
Similaire à IBM Spark Meetup - RDD & Spark Basics
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptx
shaikshazil1
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Leons Petražickis
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Yousun Jeong
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
Douglas Bernardini
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Aerospike
Introduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
Demo 0.9.4
Demo 0.9.4
eTimeline, LLC
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
Nicolas Morales
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
Hassy Veldstra
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
Boni Bruno
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Andrey Vykhodtsev
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
Similaire à IBM Spark Meetup - RDD & Spark Basics
(20)
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Architecture_Masking_Delphix.pptx
Architecture_Masking_Delphix.pptx
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Introduction to Apache Spark
Introduction to Apache Spark
Demo 0.9.4
Demo 0.9.4
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
Using SAS GRID v 9 with Isilon F810
Using SAS GRID v 9 with Isilon F810
20151015 zagreb spark_notebooks
20151015 zagreb spark_notebooks
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Dernier
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
apidays
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
manisha194592
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
firstjob4
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
Neil Barnes
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
ranjana rawat
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
Dr. Soumendra Kumar Patra
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
olyaivanovalion
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
olyaivanovalion
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Anupama Kate
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
olyaivanovalion
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
olyaivanovalion
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
ffjhghh
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
Suhani Kapoor
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Pooja Nehwal
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
atducpo
Halmar dropshipping via API with DroFx
Halmar dropshipping via API with DroFx
olyaivanovalion
Dernier
(20)
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
Halmar dropshipping via API with DroFx
Halmar dropshipping via API with DroFx
IBM Spark Meetup - RDD & Spark Basics
1.
© 2015 IBM
Corporation RDD Deep Dive • RDD Basics • How to create • RDD Operations • Lineage • Partitions • Shuffle • Type of RDDs • Extending RDD • Caching in RDD
2.
© 2015 IBM
Corporation RDD Basics • RDD (Resilient Distributed Dataset) • Distributed collection of Object • Resilient - Ability to re-compute missing partitions (node failure) • Distributed – Split across multiple partitions • Dataset - Can contain any type, Python/Java/Scala Object or User defined Object • Fundamental unit of data in spark
3.
© 2015 IBM
Corporation RDD Basics – How to create Two ways Loading external datasets Spark supports wide range of sources Access HDFS data through InputFormat & OutputFormat of Hadoop. Supports custom Input/Output format Parallelizing collection in driver program val lineRDD = sc.textFile(“hdfs:///path/to/Readme.md”) textFile(“/my/directory/*”) or textFile(“/my/directory/*.gz”) SparkContext.wholeTextFiles returns (filename,content) pair val listRDD = sc.parallelize(List(“spark”,”meetup”,”deepdive”))
4.
© 2015 IBM
Corporation RDD Operations Two type of Operations Transformation Action Transformations are lazy, nothing actually happens until an action is called. Action triggers the computation Action returns values to driver or writes data to external storage.
5.
© 2015 IBM
Corporation Lazy Evaluation Transformation on RDD, don’t get performed immediately Spark Internally records metadata to track the operation Loading data into RDD also gets lazy evaluated Lazy evaluation reduce number of passes on the data by grouping operations MapReduce – Burden on developer to merge the operation, complex map. Failure in Persisting the RDD will re-compute complete lineage every time.
6.
© 2015 IBM
Corporation RDD In Action sc.textFile(“hdfs://file.txt") .flatMap(line=>line.split(" ")) .map(word => (word,1)) .reduceByKey(_+_) .collect() I scream you scream lets all scream for icecream! I wish I were what I was when I wished I were what I am. I scream you scream lets all scream for icecream (I,1) (scream,1) (you,1) (scream,1) (lets,1) (all,1) (scream,1) (icecream,1) (icecream,1) (scream,3) (you,1) (lets,1) (I,1) (all,1)
7.
© 2015 IBM
Corporation Lineage Demo
8.
© 2015 IBM
Corporation RDD Partition Partition Definition Fragments of RDD Fragmentation allows Spark to execute in Parallel. Partitions are distributed across cluster(Spark worker) Partitioning Impacts parallelism Impacts performance
9.
© 2015 IBM
Corporation Importance of partition Tuning Too few partitions Less concurrency, unused cores. More susceptible to data skew Increased memory pressure for groupBy, reduceByKey, sortByKey, etc. Too many partitions Framework overhead (more scheduling latency than the time needed for actual task.) Many CPU context-switching Need “reasonable number” of partitions Commonly between 100 and 10,000 partitions Lower bound: At least ~2x number of cores in cluster Upper bound: Ensure tasks take at least 100ms
10.
© 2015 IBM
Corporation How Spark Partitions data Input data partition Shuffle transformations Custom Partitioner
11.
© 2015 IBM
Corporation Partition - Input Data Spark uses same class as Hadoop to perform Input/Output sc.textFile(“hdfs://…”) invokes Hadoop TextInputFormat Below are Knobs which defines #Partitions dfs.block.size – default 128MB(Hadoop 2.0) numPartition – can be used to increase number of partition default is 0 which means 1 partition mapreduce.input.fileinputformat.split.minsize – default 1kb Partition Size = Max(minsize,Min(goalSize,blockSize) goalSize = totalInputSize/numPartitions 32MB, 0, 1KB, 640MB total size - Defaults Max(1kb,Min(640MB,32MB) ) = 20 partitions 32MB, 30, 1KB , 640MB total size - Want more partition Max(1kb,Min(32MB,32MB)) = 32 partition 32MB, 5, 1KB = Max(1kb,Min(120MB,32MB)) = 20 – Bigger size partition 32MB,0, 64MB = Max(64MB,Min(640MB,32MB)) = 10 Bigger size partition
12.
© 2015 IBM
Corporation Partition - Shuffle transformations All shuffle transformation provides parameter for desire number of partition Default Behavior - Spark Uses HashPartitioner. If spark.default.parallelism is set , takes that as # of partitions If spark.default.parallelism is not set largest upstream RDD ‘s number of partition Reduces chances of out of memory 1. groupByKey 2. reduceByKey 3. aggregateByKey 4. sortByKey 5. join 6. cogroup 7. cartesian 8. coalesce 9. repartition 10.repartitionAndSort WithinPartitions Shuffle Transformation
13.
© 2015 IBM
Corporation Partition - Repartitioning RDD provides two operators repartition(numPartitions) Can Increase/decrease number of partitions Internally does shuffle expensive due to shuffle For decreasing partition use coalesce Coalesce(numPartition,Shuffle:[true/false]) Decreases partitions Goes for narrow dependencies Avoids shuffle In case of drastic reduction may trigger shuffle
14.
© 2015 IBM
Corporation Custom Partitioner Partition the data according to use case & data structure Custom Partitioning allows control over no of partitions and distribution of data Extends Partitioner class, need to implement getPartitions & numPartitons
15.
© 2015 IBM
Corporation Partitioning Demo
16.
© 2015 IBM
Corporation Shuffle - GroupByKey Vs ReduceByKey val wordCountsWithGroup = rdd .groupByKey() .map(t => (t._1, t._2.sum)) .collect()
17.
© 2015 IBM
Corporation Shuffle - GroupByKey Vs ReduceByKey val wordPairsRDD = rdd.map(word => (word, 1)) val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()
18.
© 2015 IBM
Corporation The Shuffle Redistribution of data among partition between stages. Most of the Performance, Reliability Scalability Issues in Spark occurs within Shuffle. Like MapReduce Spark shuffle uses Pull model. Consistently evolved and still an area of research in Spark
19.
© 2015 IBM
Corporation Shuffle Overview • Spark run job stage by stage. • Stages are build up by DAGScheduler according to RDD’s ShuffleDependency • e.g. ShuffleRDD / CoGroupedRDD will have a ShuffleDependency • Many operator will create ShuffleRDD / CoGroupedRDD under the hood. • Repartition/CombineByKey/GroupBy/ReduceByKey/cogroup • Many other operator will further call into the above operators • e.g. various join operator will call CoGroup. • Each ShuffleDependency maps to one stage in Spark Job and then will lead to a shuffle.
20.
© 2015 IBM
Corporation You have seen this join union groupBy Stage3 Stage1 Stage2 A: B: C: D: map E: F: G:
21.
© 2015 IBM
Corporation Shuffle is Expensive • When doing shuffle, data no longer stay in memory only, gets written to disk. • For spark, shuffle process might involve • Data partition: which might involve very expensive data sorting works etc. • Data ser/deser: to enable data been transfer through network or across processes. • Data compression: to reduce IO bandwidth etc. • Disk IO: probably multiple times on one single data block • E.g. Shuffle Spill, Merge combine
22.
© 2015 IBM
Corporation Shuffle History Shuffle module in Spark has evolved over time. Spark(0.6-0.7) – Same code path as RDD’s persist method. MEMORY_ONLY , DISK_ONLY options available. Spark (0.8-0.9) - Separate code for shuffle, ShuffleBlockManager & BlockObjectWriter for shuffle only. - Shuffle optimization - Consolidate Shuffle Write. Spark 1.0 – Introduced pluggable shuffle framework Spark 1.1 – Sort based Shuffle Implementation Spark 1.2 - Netty transfer Implementation. Sort based shuffle is default now. Spark 1.2+ - External shuffle service etc.
23.
© 2015 IBM
Corporation Understanding Shuffle Input Aggregation Types of Shuffle Hash based Basic Hash Shuffle Consolidate Hash Shuffle Sort Based Shuffle
24.
© 2015 IBM
Corporation Input Aggregation Like MapReduce, Spark involves aggregate(Combiner) on map side. Aggregation is done in ShuffleMapTask using AppendOnlyMap (In Memory Hash Table combiner) Key’s are never removed , values gets updated ExternalAppendOnlyMap (In Memory and disk Hash Table combiner) A Hash Map which can spill to disk Append Only Map that spill data to disk if insufficient memory Shuffle file In-Memory Buffer – Shuffle writes to In-memory buffer before writing to a shuffle file.
25.
© 2015 IBM
Corporation Shuffle Types – Basic Hash Shuffle Hash Based shuffle (spark.shuffle.manager). Hash Partitions the data for reducers Each map task writes each bucket to a file. #Map Tasks = M #Reduce Tasks = R #Shuffle File = M*R , #In-Memory Buffer = M*R
26.
© 2015 IBM
Corporation Shuffle Types – Basic Hash Shuffle Problem Lets use 100KB as buffer size We have 10000 reducers 10 Mapper tasks Per Executor In-Memory Buffer size will = 100KB*10000*10 Buffer need will be 10GB/Executor This huge amount of Buffer is not acceptable and this Implementation cant support 10000 reducer.
27.
© 2015 IBM
Corporation Shuffle Types – Consolidate Hash Shuffle Solution to decrease the IN-Memory Buffer size , No of File. Within Executor, Map Tasks writes each Bucket to a Segment of the file. #Shuffle file/Executor = #Reducers, # In-Memory Buffer/ Executor=#R( Reducers)
28.
© 2015 IBM
Corporation Shuffle Types – Sort Based Shuffle Consolidate Hash Shuffle needs one file for each reducer. - Total C*R intermediate file , C = # of executor running map tasks Still too many files(e.g ~10k reducers), Need significant memory for compression & serialization buffer. Too many open files issue. Sort Based Shuflle is similar to map-side shuffle from MapReduce Introduced in Spark 1.1 , now its default shuffle
29.
© 2015 IBM
Corporation Shuffle Types – Sort Based Shuffle Map output records from each task are kept in memory till they can fit. Once full , data gets sorted by partition and spilled to single file. Each Map task generate 1 data file and one index file Utilize external sorter to do the sort work If map side combiner is required data will be sorted by key and partition otherwise only by partition #reducer <=200, no sorting uses hash approach, generate file per reducer and merge them into a single file
30.
© 2015 IBM
Corporation Shuffle Reader On Reader side both Sort & Hash Shuffle uses Hash Shuffle Reader On reducer side a set of thread fetch remote output map blocks Once block comes its records are de-serialized and passed into a result queue. Records are passed to ExternalAppendOnlyMap , for ordering operation like sortByKey records are passed to externalSorter. 20 Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Bucket Reduce Task Aggregator Aggregator Aggregator Aggregator Reduce Task Reduce Task Reduce Task
31.
© 2015 IBM
Corporation Type of RDDS - RDD Interface Base for all RDDs (RDD.scala), consists of A Set of partitions (“splits” in Hadoop) A List of dependencies on parent RDDs A Function to compute the partition from its parents Optional preferred locations for each partition A Partitioner defines strategy for partitionig hash/range Basic operations like map, filter, persist etc Partitions Dependencies Compute PreferredLocations Partitioner map,filter,persist s Lineage Optimized execution Operations
32.
© 2015 IBM
Corporation Example: HadoopRDD partitions = one per HDFS block dependencies = none compute(partition) = read corresponding block preferredLocations(part) = HDFS block location partitioner = none
33.
© 2015 IBM
Corporation Example: MapPartitionRDD partitions = Parent Partition dependencies = “one-to-one “parent RDD compute(partition) = apply map on parent preferredLocations(part) = none (ask parent) partitioner = none
34.
© 2015 IBM
Corporation Example: CoGroupRDD partitions = one per reduce task dependencies = could be narrow or wide dependency compute(partition) = read and join shuffled data preferredLocations(part) = none partitioner = HashPartitioner(numTasks)
35.
© 2015 IBM
Corporation Extending RDDs Extend RDDs to To add Domain specific transformation/actions Allow developer to express domain specific calculation in cleaner way Improves code readability Easy to maintain Domain specific RDD Better way to express domain specific data Better control on partitioning and distribution Way to add new Input data source
36.
© 2015 IBM
Corporation How to Extend Add custom operators to RDD Use scala Impilicits Feels and works like built in operator You can add operator to Specific RDD or to all Custom RDD Extend RDD API to create our own RDD Implement compute & getPartitions abstract method
37.
© 2015 IBM
Corporation Implicit Class Creates an extension method to existing type Introduced in Scala 2.10 Implicits are compile time checked. Implicit class gets resolved into a class definition with implict conversion We will use Implicit to add new method in RDD
38.
© 2015 IBM
Corporation Adding new Operator to RDD We will use Scala Implicit feature to add a new operator to an existingRDD This operator will show up only in our RDD Implicit conversions are handled by Scala
39.
© 2015 IBM
Corporation Custom RDD Implementation Extending RDD allow you to create your own custom RDD structure Custom RDD allow control on computation, change partition & locality information
40.
© 2015 IBM
Corporation Caching in RDD Spark allows caching/Persisting entire dataset in memory Persisting RDD in cache First time when it is computed it will be kept in memory Reuse the the cache partition in next set of operation Fault-tolerant, recomputed in case of failure Caching is key tool for interactive and iterative algorithm Persist support different storage level Storage level - In memory , Disk or both , Techyon Serialized Vs Deserialized
41.
© 2015 IBM
Corporation Caching In RDD Spark Context tracks persistent RDDs Block Manager puts partition in memory when first evaluated Cache is lazy evaluation , no caching without an action. Shuffle also keeps its data in Cache after shuffle operations. We still need to cache shuffle RDDs
42.
© 2015 IBM
Corporation Caching Demo
Télécharger maintenant