SlideShare une entreprise Scribd logo
1  sur  33
When DataFrames fail,
resort to mapPartitions
aka “What to do when Spark can’t handle it”
Berlin Buzzwords, June 2018
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Why this talk?
Complex problem with an easy solution
(or not)
2k+ lines of resulting dataframe tree schema
Logically correct vs Correct
Logically correct solution
● Data analysis process (data exploration)
● Get to the correct output ASAP
● Validate result
● Run on subset (if possible)
○ Quick iterations
○ Tends to eliminate data issues
Correct solution
● Correct result
● Repeatable performance
● Scalable and optimized
● Proper data handling
Let’s dig in
spark.read.jdbc
What in the world?
Specify columnName for proper distribution
numPartitions defines parallelism
Query for bounds (upper and lower)
val df = (spark.read.jdbc(url=jdbcUrl,
table="some_table",
columnName="some_column",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
(Re)partitioning
df.repartition(100) - repartition to a number of partitions
df.repartition(100, col("some_column")) - repartition and split on column
df.coalesce(10) - coalesce (narrow dependency)
(Re)partitioning
Partition by a join field
Repartition is expensive but can help down the road
coalesce vs repartition
HashPartitioner
Number of partitions divisible by available cores
(usually from 2 to 10 x cores)
Data locality
PROCESS_LOCAL - data is in the same JVM
NODE_LOCAL - data is on the same node (HDFS, local executor)
RACK_LOCAL, ANY - data is not on the executors
NO_PREF - no locality preference
Data locality
Work distribution
Work distribution
Broadcasting variables
Keeps a copy of data on each worker
Reduces the size of a serialized task as data is on the worker already
Optimizes join operations
Checkpointing
Truncate lineage graph
Reliable - HDFS storage (sc.setCheckpointDir)
Local - node local temp storage
Metrics
Spark WebUI is your friend
Use Spark History Server
JSON metrics over REST API
Collect JMX metrics (Dropwizard)
Event timeline
Storage tab
RDD storage details
But what if all fails?
What is Apache Spark?
Let’s use it
Spark is a distributed computing framework
Spark provides resource management
Spark provides controllable work distribution
Spark is highly configurable
Why not?
MapPartitions function
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
MapPartitions function - indexed
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
MapPartitions example
def mapFunction(args): Iterator[Row] => Iterator[Row] = iterator => {
// Processing based on the partition data
// Example:
// val ids = iterator.toList.map(row => row.getInt(0))
// Processor.doWork(ids)
iterator
}
Other usages
Applying custom processing code
Running ML pipeline
Using third party libraries
Misc
How about them numbers?
Results
Processing time dropped from ~96 hours to ~3.5hours (28 times)
Adding new entities doesn’t significantly impact complexity/performance
Additional control which we leveraged down the road
We managed to use all the resources
We managed to use all the resources
Raised code complexity
PROs
CONs
Q&A
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Thank you

Contenu connexe

Tendances

Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applicationsKexin Xie
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query OptimizationBrian Gallagher
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in sparkSubhasish Guha
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query OptimizationAli Usman
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluationavniS
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?ArangoDB Database
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlMonitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlCitus Data
 
Basic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder KalerBasic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder KalerAvjinder (Avi) Kaler
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Как приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиКак приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиSQALab
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 

Tendances (20)

Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Semi join
Semi joinSemi join
Semi join
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query Optimization
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?
 
Unit 3 lecture-2
Unit 3 lecture-2Unit 3 lecture-2
Unit 3 lecture-2
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas FittlMonitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
Monitoring Postgres at Scale | PostgresConf US 2018 | Lukas Fittl
 
Basic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder KalerBasic Tutorial of Association Mapping by Avjinder Kaler
Basic Tutorial of Association Mapping by Avjinder Kaler
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Как приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиКак приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практики
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 

Similaire à Berlin buzzwords 2018

Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache SparkMarcoYuriFujiiMelo
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairsphanleson
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1Rahul Kumar
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Optimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrameOptimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrameKnoldus Inc.
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Drizzles Approach To Improving Performance Of The Server
Drizzles  Approach To  Improving  Performance Of The  ServerDrizzles  Approach To  Improving  Performance Of The  Server
Drizzles Approach To Improving Performance Of The ServerPerconaPerformance
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 

Similaire à Berlin buzzwords 2018 (20)

Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Learning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value PairsLearning spark ch04 - Working with Key/Value Pairs
Learning spark ch04 - Working with Key/Value Pairs
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1Databricks spark-knowledge-base-1
Databricks spark-knowledge-base-1
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Optimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrameOptimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrame
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Drizzles Approach To Improving Performance Of The Server
Drizzles  Approach To  Improving  Performance Of The  ServerDrizzles  Approach To  Improving  Performance Of The  Server
Drizzles Approach To Improving Performance Of The Server
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 

Dernier

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456KiaraTiradoMicha
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxalwaysnagaraju26
 

Dernier (20)

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 

Berlin buzzwords 2018

  • 1.
  • 2. When DataFrames fail, resort to mapPartitions aka “What to do when Spark can’t handle it” Berlin Buzzwords, June 2018 Matija Gobec matija.gobec@smartcat.io @mad_max0204
  • 4. Complex problem with an easy solution (or not)
  • 5. 2k+ lines of resulting dataframe tree schema
  • 7. Logically correct solution ● Data analysis process (data exploration) ● Get to the correct output ASAP ● Validate result ● Run on subset (if possible) ○ Quick iterations ○ Tends to eliminate data issues
  • 8. Correct solution ● Correct result ● Repeatable performance ● Scalable and optimized ● Proper data handling
  • 10. spark.read.jdbc What in the world? Specify columnName for proper distribution numPartitions defines parallelism Query for bounds (upper and lower) val df = (spark.read.jdbc(url=jdbcUrl, table="some_table", columnName="some_column", lowerBound=1L, upperBound=100000L, numPartitions=100, connectionProperties=connectionProperties))
  • 11. (Re)partitioning df.repartition(100) - repartition to a number of partitions df.repartition(100, col("some_column")) - repartition and split on column df.coalesce(10) - coalesce (narrow dependency)
  • 12. (Re)partitioning Partition by a join field Repartition is expensive but can help down the road coalesce vs repartition HashPartitioner Number of partitions divisible by available cores (usually from 2 to 10 x cores)
  • 13. Data locality PROCESS_LOCAL - data is in the same JVM NODE_LOCAL - data is on the same node (HDFS, local executor) RACK_LOCAL, ANY - data is not on the executors NO_PREF - no locality preference
  • 17. Broadcasting variables Keeps a copy of data on each worker Reduces the size of a serialized task as data is on the worker already Optimizes join operations
  • 18. Checkpointing Truncate lineage graph Reliable - HDFS storage (sc.setCheckpointDir) Local - node local temp storage
  • 19. Metrics Spark WebUI is your friend Use Spark History Server JSON metrics over REST API Collect JMX metrics (Dropwizard)
  • 23. But what if all fails?
  • 24. What is Apache Spark?
  • 25. Let’s use it Spark is a distributed computing framework Spark provides resource management Spark provides controllable work distribution Spark is highly configurable Why not?
  • 26. MapPartitions function /** * Return a new RDD by applying a function to each partition of this RDD. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitions[U: ClassTag]( f: Iterator[T] => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]
  • 27. MapPartitions function - indexed /** * Return a new RDD by applying a function to each partition of this RDD, while tracking the index * of the original partition. * * `preservesPartitioning` indicates whether the input function preserves the partitioner, which * should be `false` unless this is a pair RDD and the input function doesn't modify the keys. */ def mapPartitionsWithIndex[U: ClassTag]( f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false): RDD[U]
  • 28. MapPartitions example def mapFunction(args): Iterator[Row] => Iterator[Row] = iterator => { // Processing based on the partition data // Example: // val ids = iterator.toList.map(row => row.getInt(0)) // Processor.doWork(ids) iterator }
  • 29. Other usages Applying custom processing code Running ML pipeline Using third party libraries Misc
  • 30. How about them numbers?
  • 31. Results Processing time dropped from ~96 hours to ~3.5hours (28 times) Adding new entities doesn’t significantly impact complexity/performance Additional control which we leveraged down the road We managed to use all the resources We managed to use all the resources Raised code complexity PROs CONs
  • 32. Q&A